Contents
Chapter 1 Introduction
1.1 Anatomy of an XML Document
1.2 Differences Between XML and Relational Data
1.3 Overview of DB2 pureXML
1.4 Benefits of DB2 pureXML over Alternative Storage Options for XML Data
1.5 XML Solutions to Relational Data Model Problems
1.5.1 When the Schema Is Volatile
1.5.2 When Data Is Inherently Hierarchical in Nature
1.5.3 When Data Represents Business Objects
1.5.4 When Objects Have Sparse Attributes
1.5.5 When Data Needs to be Exchanged
1.6 Summary
Chapter 2 Designing XML Data and Applications
2.1 Choosing Between XML Elements and XML Attributes
2.2 XML Tags versus Values
2.3 Choosing the Right Document Granularity
2.4 Using a Hybrid XML/Relational Approach
2.5 Summary
Chapter 3 Designing and Managing XML Storage Objects
3.1 Understanding XML Document Trees
3.2 Understanding pureXML Storage
3.3 XML Storage in DB2 for Linux, UNIX, and Windows
3.3.1 Storage Objects for XML Data
3.3.2 Defining Columns,Tables, and Table Spaces for XML Data
3.3.3 Dropping XML Columns
3.3.4 Improved XML Storage Format in DB2 9.7
3.4 Using XML Base Table Row Storage (Inlining)
3.4.1 Monitoring and Configuring XML Inlining
3.4.2 Potential Benefits and Drawbacks of XML Inlining
3.5 Compressing XML Data
3.6 Examining XML Storage Space Consumption
3.7 Reorganizing XML Data and Indexes
3.8 Understanding XML Space Management
3.9 XML in Range Partitioned Tables and MDC Tables
3.9.1 XML and Range Partitioning
3.9.2 XML and Multidimensional Clustering
3.10 XML in a Partitioned Database (DPF)
3.11 XML Storage in DB2 for z/OS
3.11.1 Storage Objects for XML Data
3.11.2 Characteristics of XML Table Spaces
3.11.3 Tables with Multiple XML Columns
3.11.4 Naming and Storage Conventions
3.12 Utilities for XML Objects in DB2 for z/OS
3.12.1 REPORT TABLESPACESET for XML
3.12.2 Reorganizing XML Data in DB2 for z/OS
3.12.3 CHECK DATA for XML
3.13 XML Parsing and Memory Consumption in DB2 for z/OS
3.13.1 Controlling the Memory Consumption of XML Operations
3.13.2 Redirecting XML Parsing to zIIP and zAAP
3.14 Summary
Chapter 4 Inserting and Retrieving XML Data
4.1 Inserting XML Documents
4.1.1 Simple Insert Statements
4.1.2 Reading XML Documents from Files or URLs
4.2 Deleting XML Documents
4.3 Retrieving XML Documents
4.4 Handling Documents with XML Declarations
4.5 Copying Full XML Documents
4.6 Dealing with XML Special Characters
4.7 Understanding XML Whitespace and Document Storage
4.7.1 Preserving XML Whitespace
4.7.2 Changing the Whitespace Default from “Strip” to “Preserve”
4.7.3 Storing XML Documents for Compliance
4.8 Summary
Chapter 5 Moving XML Data
5.1 Exporting XML Data in DB2 for Linux, UNIX, and Windows
5.1.1 Exporting XML Documents to a Single File
5.1.2 Exporting XML Documents as Individual Files
5.1.3 Exporting XML Documents as Individual Files with Non-Default Names
5.1.4 Exporting XML Documents to One or Multiple Dedicated Directories
5.1.5 Exporting Fragments of XML Documents
5.1.6 Exporting XML Data with XML Schema Information
5.2 Importing XML Data in DB2 for Linux, UNIX, and Windows
5.2.1 IMPORT Command and Input Files
5.2.2 Import/Insert Performance Tips
5.3 Loading XML Data in DB2 for Linux, UNIX, and Windows
5.4 Unloading XML Data in DB2 for z/OS
5.5 Loading XML Data in DB2 for z/OS
5.6 Validating XML Documents during Load and Insert Operations
5.7 Splitting Large XML Documents into Smaller Documents
5.8 Replicating and Publishing XML Data
5.9 Federating XML Data
5.10 Managing XML Data with HADR
5.11 Handling XML Data in db2look and db2move
5.12 Summary
Chapter 6 Querying XML Data
6.1 An Overview of Querying XML Data
6.2 Understanding the XQuery and XPath Data Model
6.2.1 Sequences
6.2.2 Sequence in, Sequence out
6.3 Sample Data for XPath, SQL/XML, and XQuery
6.4 Introduction to XPath
6.4.1 Analogy Between XPath and Navigating a File System
6.4.2 Simple XPath Queries
6.5 How to Execute XPath in DB2
6.6 Wildcards and Double Slashes
6.7 XPath Predicates
6.8 Existential Semantics
6.9 Logical Expressions with and, or, not()
6.10 The Current Context and the Parent Step
6.11 Positional Predicates
6.12 Union and Construction of Sequences
6.13 XPath Functions
6.14 General and Value Comparisons
6.15 XPath Axes and Unabbreviated Syntax
6.16 Summary
Chapter 7 Querying XML Data with SQL/XML
7.1 Overview of SQL/XML
7.2 Retrieving XML Documents or Document Fragments with XMLQUERY
7.2.1 Referencing XML Columns in SQL/XML Functions
7.2.2 Retrieving Element Values Without XML Tags
7.2.3 Retrieving Repeating Elements with XMLQUERY
7.3 Retrieving XML Values in Relational Format with XMLTABLE
7.3.1 Generating Rows and Columns from XML Data
7.3.2 Dealing with Missing Elements
7.3.3 Avoiding Type Errors
7.3.4 Retrieving Repeating Elements with XMLTABLE
7.3.5 Numbering XMLTABLE Rows Based on Repeating Elements
7.3.6 Retrieving Multiple Repeating Elements at Different Levels
7.4 Using XPath Predicates in SQL/XML with XMLEXISTS
7.5 Common Mistakes with SQL/XML Predicates
7.6 Using Parameter Markers or Host Variables
7.7 XML Queries with Dynamically Computed XPath Expressions
7.8 Ordering a Query Result Set Based on XML Values
7.9 Converting XML Values to Binary SQL Types
7.10 Summary
Chapter 8 Querying XML Data with XQuery
8.1 XQuery Overview
8.2 Processing XML Data with FLWOR Expressions
8.2.1 Anatomy of a FLWOR Expression
8.2.2 Understanding the for and let Clauses
8.2.3 Understanding the where and order by Clauses
8.2.4 FLWOR Expressions with Multiple for and let Clauses
8.3 Comparing FLWOR Expressions, XPath Expressions, and SQL/XML
8.3.1 Traversing XML Documents
8.3.2 Using XML Predicates
8.3.3 Result Set Cardinalities in XQuery and SQL/XML
8.3.4 Using FLWOR Expressions in SQL/XML
8.4 Constructing XML Data
8.4.1 Constructing Elements with Computed Values
8.4.2 Constructing XML Data with Predicates and Conditions
8.4.3 Constructing Documents with Multiple Levels of Nesting
8.4.4 Constructing Documents with XML Aggregation in SQL/XML Queries
8.5 Data Types, Cast Expressions, and Type Errors
8.6 Arithmetic Expressions
8.7 XQuery Functions
8.7.1 String Functions
8.7.2 Number and Aggregation Functions
8.7.3 Sequence Functions
8.7.4 Namespace and Node Functions
8.7.5 Date and Time Functions
8.7.6 Boolean Functions
8.8 Embedding SQL in XQuery
8.9 Using SQL Functions and User-Defined Functions in XQuery
8.10 Summary
Chapter 9 Querying XML Data
9.1 Aggregation and Grouping of XML Data
9.1.1 Aggregation and Grouping Queries with XMLTABLE
9.1.2 Aggregation of Values within and across XML Documents
9.1.3 Grouping Queries in SQL/XML versus XQuery
9.2 Join Queries with XML Data
9.2.1 XQuery Joins between XML Columns
9.2.2 SQL/XML Joins between XML Columns
9.2.3 Joins between XML and Relational Columns
9.2.4 Outer Joins between XML Columns
9.3 Case-Insensitive XML Queries
9.4 How to Avoid “Bad” Queries
9.4.1 Construction of Excessively Large Documents
9.4.2 “Between” Predicates on XML Data
9.4.3 Large Global Sequences
9.4.4 Multilevel Nesting SQL and XQuery
9.5 Common Errors and How to Avoid Them
9.5.1 SQL16001N
9.5.2 SQL16002N
9.5.3 SQL16003N
9.5.4 SQL16005N
9.5.5 SQL16015N
9.5.6 SQL16011N
9.5.7 SQL16061N
9.5.8 SQL16075N
9.6 Summary
Chapter 10 Producing XML from Relational Data
10.1 SQL/XML Publishing Functions
10.1.1 Constructing XML Elements from Relational Data
10.1.2 NULL Values, Missing Elements, and Empty Elements
10.1.3 Constructing XML Attributes from Relational Data
10.1.4 Constructing XML Documents from Multiple Relational Rows
10.1.5 Constructing XML Documents from Multiple Relational Tables
10.1.6 Comparing XMLAGG,XMLCONCAT, and XMLFOREST
10.1.7 Conditional Element Construction
10.1.8 Leading Zeros in Constructed Elements and Attributes
10.1.9 Default Tagging of Relational Data with XMLROW and XMLGROUP
10.1.10 GUI-Based Definition of SQL/XML Publishing Queries
10.1.11 Constructing Comments, Processing Instructions, and Text Nodes
10.1.12 Legacy Functions
10.2 Using XQuery Constructors with Relational Input
10.3 XML Declarations for Constructed XML Data
10.4 Inserting Constructed XML Data into XML Columns
10.5 Summary
Chapter 11 Converting XML to Relational Data
11.1 Advantages and Disadvantages of Shredding
11.2 Shredding with the XMLTABLE Function
11.2.1 Hybrid XML Storage
11.2.2 Relational Views over XML Data
11.3 Shredding with Annotated XML Schemas
11.3.1 Annotating an XML Schema
11.3.2 Defining Schema Annotations Visually in IBM Data Studio
11.3.3 Registering an Annotated Schema
11.3.4 Decomposing One XML Document at a Time
11.3.5 Decomposing XML Documents in Bulk
11.4 Summary
Chapter 12 Updating and Transforming XML Documents
12.1 Replacing a Full XML Document
12.2 Modifying Documents with XQuery Updates
12.3 Updating the Value of an XML Node in a Document
12.3.1 Replacing an Element Value
12.3.2 Replacing an Attribute Value
12.3.3 Replacing a Value Using a Parameter Marker
12.3.4 Replacing Multiple Values in a Document
12.3.5 Replacing an Existing Value with a Computed Value
12.4 Replacing XML Nodes in a Document
12.5 Deleting XML Nodes from a Document
12.6 Renaming Elements or Atttributes in a Document
12.7 Inserting XML Nodes into a Document
12.7.1 Defining the Position of Inserted Elements
12.7.2 Defining the Position of Inserted Attributes
12.7.3 Insert Examples
12.8 Handling Repeating and Missing Nodes
12.9 Modifying Multiple XML Nodes in the Same Document
12.9.1 Snapshot Semantics and Confiict Situations
12.9.2 Converting Elements to Attributes and Vice Versa
12.10 Modifying XML Documents in Queries
12.11 Modifying XML Documents in Insert Operations
12.12 Modifying XML Documents in Update Cursors
12.13 XML Updates in DB2 for z/OS
12.14 Transforming XML Documents with XSLT
12.14.1 The XSLTRANSFORM Function
12.14.2 XML to HTML Transformation
12.15 Summary
Chapter 13 Defining and Using XML Indexes
13.1 Defining XML Indexes
13.1.1 Unique XML Indexes
13.1.2 Lean XML Indexes
13.1.3 Using the DB2 Control Center to Create XML Indexes
13.2 XML Index Data Types
13.2.1 VARCHAR(n)
13.2.2 VARCHAR HASHED
13.2.3 DOUBLE and DECFLOAT
13.2.4 DATE and TIMESTAMP
13.2.5 Choosing a Suitable Index Data Type
13.2.6 Rejecting Invalid Values
13.3 Using XML Indexes to Evaluate Query Predicates
13.3.1 Understanding Index Eligibility
13.3.2 Data Types in XML Indexes and Query Predicates
13.3.3 Text Nodes in XML Indexes and Query Predicates
13.3.4 Wildcards in XML Indexes and Query Predicates
13.3.5 Using Indexes for Structural Predicates
13.4 XML Indexes and Join Predicates
13.5 XML Indexes on Non-Leaf Elements
13.6 Special Cases Where XML Indexes Cannot be Used
13.6.1 Special Cases with XMLQUERY
13.6.2 Parent Steps
13.6.3 The let and return Clauses
13.7 XML Index Internals
13.7.1 XML Index Keys
13.7.2 Logical and Physical XML Indexes
13.8 XML Index Statistics
13.9 Summary
Chapter 14 XML Performance and Monitoring
14.1 Explaining XML Queries in DB2 for Linux,UNIX, and Windows
14.1.1 The Explain Tables in DB2 for Linux, UNIX, and Windows
14.1.2 Using db2exfmt to Obtain Access Plans
14.1.3 Using Visual Explain to Display Access Plans
14.1.4 Access Plan Operators
14.1.5 Understanding and Analyzing XML Query Execution Plans
14.2 Explaining XML Queries in DB2 for z/OS
14.2.1 The Explain Tables in DB2 for z/OS
14.2.2 Obtaining Access Plan Information in SPUFI
14.2.3 Using Visual Explain to Display Access Plans
14.2.4 Access Plan Operators
14.2.5 Understanding and Analyzing XML Query Execution Plans
14.3 Statistics Collection for XML Data
14.3.1 Statistics Collection for XML Data in DB2 for z/OS
14.3.2 Statistics Collection for XML Data in DB2 for Linux, UNIX, and Windows
14.3.3 Examining XML Statistics with db2cat
14.4 Monitoring XML Activity
14.4.1 Using the Snapshot Monitor in DB2 for Linux, UNIX, and Windows
14.4.2 Monitoring Database Utilities
14.5 Best Practices for XML Performance
14.5.1 XML Document Design
14.5.2 XML Storage
14.5.3 XML Queries
14.5.4 XML Indexes
14.5.5 XML Updates
14.5.6 XML Schemas
14.5.7 XML Applications
14.6 Summary
Chapter 15 Managing XML Data with Namespaces
15.1 Introduction to XML Namespaces
15.1.1 Namespace Declarations in XML Documents
15.1.2 Default Namespaces
15.2 Exploring Namespaces in XML Documents
15.3 Querying XML Data with Namespaces
15.3.1 Declaring Namespaces in XML Queries
15.3.2 Using Namespace Declarations in SQL/XML Queries
15.3.3 Using Namespaces in the XMLTABLE Function
15.3.4 Dealing with Multiple Namespaces per Document
15.4 Creating Indexes for XML Data with Namespaces
15.5 Constructing XML Data with Namespaces
15.5.1 SQL/XML Publishing Functions and Namespaces
15.5.2 XQuery Constructors and Namespaces
15.6 Updating XML Data with Namespaces
15.6.1 Updating Values in Documents with Namespaces
15.6.2 Renaming Nodes in Documents with Namespace Prefixes
15.6.3 Renaming Nodes in Documents with Default Namespaces
15.6.4 Inserting and Replacing Nodes in Documents with Namespaces
15.7 Summary
Chapter 16 Managing XML Schemas
16.1 Introduction to XML Schemas and Their Usage
16.1.1 Valid Versus Well-Formed XML Documents
16.1.2 To Validate or Not to Validate,That Is the Question!
16.1.3 Custom Versus Industry Standard XML Schemas
16.2 Anatomy of an XML Schema
16.3 An XML Schema with Include and Import
16.4 Registering XML Schemas
16.4.1 Registering XML Schemas in the DB2 Command Line Processor
16.4.2 Registering XML Schemas from Applications via Stored Procedures
16.4.3 Registering XML Schemas from Java Applications via JDBC
16.4.4 Two XML Schemas Sharing a Common Schema Document
16.4.5 Error Situations and How to Resolve Them
16.5 Removing XML Schemas from the Schema Repository
16.6 XML Schema Evolution
16.6.1 Schema Evolution Without Document Validation
16.6.2 Generic Schema Evolution with Document Validation
16.6.3 Compatible Schema Evolution with the UPDATE XMLSCHEMA Command
16.7 Granting and Revoking XML Schema Usage Privileges
16.8 Document Type Definitions (DTDs) and External Entities
16.9 Browsing the XML Schema Repository (XSR)
16.9.1 Tables and Views of the XML Schema Repository
16.9.2 Queries against the XML Schema Repository
16.10 XML Schema Considerations in DB2 for z/OS
16.11 Summary
Chapter 17 Validating XML Documents against XML Schemas
17.1 Document Validation Upon Insert
17.2 Document Validation Upon Update
17.3 Validation without Rejecting Invalid Documents
17.4 Enforcing Validation with Check Constraints
17.5 Automatic Validation with Triggers
17.6 Diagnosing Validation and Parsing Errors
17.7 Validation during Load and Import Operations
17.7.1 Validation against a Single XML Schema
17.7.2 Validation against Multiple XML Schemas
17.7.3 Using a Default XML Schema
17.7.4 Overriding XML Schema References
17.7.5 Validation Based on schemaLocation Attributes
17.8 Checking Whether an Existing Document Has Been Validated
17.9 Validating Existing Documents in a Table
17.10 Finding the XML Schema for a Validated Document
17.11 How to Undo Document Validation
17.12 Considerations for Validation in DB2 for z/OS
17.12.1 Document Validation Upon Insert
17.12.2 Document Validation Upon Update
17.12.3 Validating Existing Documents in a Table
17.12.4 Summary of Platform Similarities and Differences
17.13 Summary
Chapter 18 Using XML in Stored Procedures, UDFs, and Triggers
18.1 Manipulating XML in SQL Stored Procedures
18.1.1 Basic XML Manipulation in Stored Procedures
18.1.2 A Stored Procedure to Store XML in a Hybrid Manner
18.1.3 Loops and Cursors
18.1.4 A Stored Procedure to Update a Selected XML Element or Attribute
18.1.5 Three Tips for Testing Stored Procedures
18.2 Manipulating XML in User-Defined Functions
18.2.1 A UDF to Extract an Element or Attribute Value
18.2.2 A UDF to Extract the Values of a Repeating Element
18.2.3 A UDF to Shred XML Data to a Relational Table
18.2.4 A UDF to Modify an XML Document
18.3 Manipulating XML Data with Triggers
18.3.1 Insert Triggers on Tables with XML Columns
18.3.2 Delete Triggers on Tables with XML Columns
18.3.3 Update Triggers on XML Columns
18.4 Summary
Chapter 19 Performing Full-Text Search
19.1 Overview of Text Search in DB2
19.2 Sample Table and Data
19.3 Enabling a Database for the DB2 Net Search Extender
19.4 Managing Full-Text Indexes with the DB2 Net Search Extender
19.4.1 Creating Basic Text Indexes
19.4.2 Creating Text Indexes with Specific Storage Paths
19.4.3 Creating Text Indexes with a Periodic Update Schedule
19.4.4 Creating Text Indexes for Specific Parts of Each Document
19.4.5 Creating Text Indexes with Advanced Options
19.4.6 Updating and Reorganizing Text Indexes
19.4.7 Altering Text Indexes
19.5 Performing XML Full-Text Search with the DB2 Net Search Extender
19.5.1 Full-Text Search in SQL and XQuery
19.5.2 Full-Text Search with Boolean Operators
19.5.3 Full-Text Search with Custom Document Models
19.5.4 Advanced Search with Proximity, Fuzzy, and Stemming Options
19.5.5 Finding the Correct Match within an XML Document
19.5.6 Search Conditions on Sibling Branches of an XML Document
19.5.7 Text Search in the Presence of Namespaces
19.6 DB2 Text Search
19.6.1 Enabling a Database for DB2 Text Search
19.6.2 Creating and Maintaining Full-Text Indexes for DB2 Text Search
19.6.3 Writing DB2 Text Search Queries for XML Data
19.6.4 Full-Text Search with XPath Expressions
19.6.5 Full-Text Search with Wildcards
19.7 Summary of Text Search Administration Commands
19.8 XML Full-Text Search in DB2 for z/OS
19.9 Summary
Chapter 20 Understanding XML Data Encoding
20.1 Understanding Internal and External XML Encoding
20.1.1 Internally Encoded XML Data
20.1.2 Externally Encoded XML Data
20.2 Avoiding Code Page Conversions
20.3 Using Non-Unicode Databases for XML
20.4 Examples of Code Page Issues
20.4.1 Example 1
20.4.2 Example 2
20.4.3 Example 3
20.4.4 Example 4
20.4.5 Example 5
20.5 Avoiding Data Loss and Encoding Errors in Non-Unicode Databases
20.6 Summary
Chapter 21 Developing XML Applications with DB2
21.1 The Value of DB2 pureXML for Application Development
21.1.1 Avoid XML Parsing in the Application Layer
21.1.2 Storing Business Objects in an Intuitive Format
21.1.3 Rapid Prototyping
21.1.4 Responding Quickly to Changing Business Needs
21.2 Using Parameter Markers or Host Variables
21.3 Java Applications
21.3.1 XML Support in JDBC 3.0
21.3.2 XML Support in JDBC 4.0
21.3.3 Comprehensive Example of Manipulating XML Data with JDBC 4.0
21.3.4 Creating XML Documents from Application Data
21.3.5 Binding XML Data to Java Objects
21.3.6 IBM pureQuery
21.4 .NET Applications
21.4.1 Querying XML Data in .NET Applications
21.4.2 Manipulating XML Data in .NET Applications
21.4.3 Inserting XML Data from .NET Applications
21.4.4 XML Schema and DTD Handling in .NET Applications
21.5 CLI Applications
21.6 Embedded SQL Applications
21.6.1 COBOL Applications with Embedded SQL
21.6.2 PL/1 Applications with Embedded SQL
21.6.3 C Applications with Embedded SQL
21.7 PHP Applications
21.8 Perl Applications
21.9 XML Application Development Tools
21.9.1 IBM Data Studio Developer
21.9.2 IBM Database Add-ins for Visual Studio
21.9.3 Altova XML Tools
21.9.4 <oXygen/>
21.9.5 Stylus Studio
21.10 Summary
Chapter 22 Exploring XML Information in the DB2 Catalog
22.1 XML-Related Catalog Information in DB2 for Linux, UNIX, and Windows
22.1.1 Catalog Information for XML Columns
22.1.2 The XML Strings and Paths Tables
22.1.3 The Internal XML Regions and Path Indexes
22.1.4 Catalog Information for User-Defined XML Indexes
22.1.5 Catalog Information for XML Schemas
22.2 XML-Related Catalog Information in DB2 for z/OS
22.2.1 Catalog Information for XML Storage Objects
22.2.2 Catalog Information for XML Indexes
22.2.3 Catalog Information for XML Schemas
22.3 Summary
Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz
23.1 Designing XML Data and Applications
23.2 Designing and Managing Storage Objects for XML
23.3 Inserting and Retrieving XML Data
23.4 Moving XML Data
23.5 Querying XML
23.6 Producing XML from Relational Data
23.7 Converting XML to Relational Data
23.8 Updating and Transforming XML Documents
23.9 Defining and Using XML Indexes
23.10 XML Performance and Monitoring
23.11 Managing XML Data with Namespaces
23.12 XML Schemas and Validation
23.13 Performing Full-Text Search
23.14 XML Application Development
23.15 Answers
Appendix A
A.1 Exploring the Structure of XML Documents
A.1.1 Exploring XML Documents in the DB2 Control Center
A.1.2 Exploring XML Documents in the CLP
A.1.3 Exploring XML Documents in SPUFI
A.2 Tips for Running XML Operations in the CLP
Appendix B
B.1 XML Sample Database on DB2 for Linux, UNIX, and Windows
B.2 XML Sample Tables on DB2 for z/OS
B.3 Table customer—Column info
B.4 Table product—Column description
B.5 Table purchaseorder—Column porder
Appendix C
C.1 General Resources for All Chapters
C.2 Chapter-Specific Resources
C.3 Resources on the Integration of DB2 pureXML with Other Products
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y–Z

Автор: Nicola M.   Kumar-Chatterjee P.  

Теги: programming   data analysis  

ISBN: 978-0-13-815047-1

Год: 2010

Текст
                    
Related Books of Interest DB2 9 for Linux, UNIX, and Windows DBA Guide, Reference, and Exam Prep, Sixth Edition Understanding DB2 Learning Visually with Examples, Second Edition by George Baklarz and Paul C. Zikopoulos by Raul F. Chong, Xiaomei Wang, Michael Dang, and Dwaine R. Snow ISBN: 0-13-185514-X ISBN: 0-13-158018-3 The sixth edition of this classic offers complete, ® 9 administra® tion and development for Linux , UNIX®, and Windows® platforms, as well as authoritative preparation for the latest IBM® exam. Written for both DBAs and developers, IBM DB2 9 and DB2 9.5 provide breakthrough capabilities for providing Information on Demand, implementing Web services and Service Oriented Architecture, and streamlining information management. Understanding DB2: Learning Visually with Examples, Second Edition, is the easiest way to master the latest versions of DB2 and apply their full power to your business challenges. Written by four IBM DB2 experts, this book introduces key concepts with dozens of examples drawn from the authors’ experience working with DB2 in enterprise environments. Thoroughly updated for DB2 9.5, it covers new innovations ranging from manageability to performance and XML support to API integration. Each concept is presented with easy-to-understand screenshots, diagrams, charts, and tables. This book is for everyone who works with DB2: database administrators, system administrators, developers, and consultants. With hundreds of well-designed review questions and answers, it will also help profession- ers all aspects of deploying and managing DB2 9, including DB2 database design and development; day-to-day administration and backup; deployment of networked, Internet-centered, and SOAbased applications; migration; and much more. tips for optimizing performance, availability, and value. Download Complete DB2 V9 Trial Version Visit ibm.com/db2/9/download.html to download a complete trial version of DB2, which enables you to try out dozens of the most powerful features of DB2 for yourself – everything from pureXML™ support to automated administration and optimization. Listen to the author’s podcast at: ibmpressbooks.com/podcasts 730, 731, or 736. Listen to the author’s podcast at: ibmpressbooks.com/podcasts Sign up for the monthly IBM Press newsletter at ibmpressbooks/newsletters
Related Books of Interest Understanding DB2 9 Security By Rebecca Bond, Kevin Yeung-Kuen See, Carmen Ka Man Wong, and Yuk-Kuen Henry Chan ISBN: 0-13-134590-7 Understanding DB2 9 Security is a comprehensive guide to securing DB2 and leveraging the powerful new security features of DB2 9. Direct from a DB2 Security deployment expert and the IBM DB2 development team, this book gives DBAs and their managers a wealth of security information that is available nowhere else. It presents real-world implementation scenarios, step-by-step examples, and expert guidance on both the technical and human sides of DB2 security. This book’s material is organized to support you through every step of securing DB2 in Windows, Linux, or UNIX environments. You’ll start by exploring the regulatory and business issues driving your security efforts, and then master the technological and managerial knowledge crucial to effective implementation. Next, the authors offer practical guidance on post-implementation auditing, and show how to systematically maintain security on an ongoing basis. Mining the Talk Unlocking the Business Value in Unstructured Information by Scott Spangler, and Jeffrey Kreulen ISBN: 0-13-233953-6 In Mining the Talk, two leading-edge IBM researchers introduce a revolutionary new approach to unlocking the business value hidden in virtually any form of unstructured data – from word processing documents to websites, emails to instant messages. The authors review the business drivers that have made unstructured data so important and explain why conventional methods for working with it are inadequate. Then, writing for business professionals – not just data mining specialists – they walk step-bystep through exploring your unstructured data, understanding it, and analyzing it effectively. key areas: learning from your customer interactions; hearing the voices of customers when they’re not talking to you; discovering the “collective consciousness” of your own organization; enhancing innovation; and spotting emerging trends. Whatever your organization, Mining the Talk offers you breakthrough opportunities to become more responsive, agile, and competitive. Listen to the author’s podcast at: ibmpressbooks.com/podcasts Visit ibmpressbooks.com for all product information
Related Books of Interest An Introduction to IMS Meltz, Long, Harrington, Hain, Nicholls ISBN: 0-13-185671-5 A Practical Guide to Trusted Computing Enterprise Master Data Management by Allen Dreibelbis, Eberhard Hechler, Ivan Milman, Martin Oberhofer, Paul van Run, and Dan Wolfson ISBN: 0-13-236625-8 Enterprise Master Data Management provides an authoritative, vendor-independent MDM technical reference for practitioners: architects, technical analysts, consultants, solution designers, and senior IT decision makers. Written by the IBM® data management innovators who are pioneering MDM, this book systematically introduces MDM’s key concepts and technical themes, explains its business case, and illuminates how it interrelates with and enables SOA. Challener, Yoder, Catherman, Safford, Van Doorn ISBN: 0-13-239842-7 Mainframe Basics for Security Professionals Pomerantz, Weele, Nelson, Hahn ISBN: 0-13-173856-9 Service-Oriented Architecture (SOA) Compass Bieberstein, Bose, Fiammante, Jones, Shah ISBN: 0-13-187002-5 WebSphere Business Integration Primer Iyengar, Jessani, Chilanti ISBN: 0-13-224831-X Drawing on their experience with cutting-edge projects, the authors introduce MDM patterns, blueprints, solutions, and best practices published nowhere else—everything you need to establish a consistent, manageable set of master data, and use it for competitive advantage. Sign up for the monthly IBM Press newsletter at ibmpressbooks/newsletters Outside-in Software Development Kessler, Sweitzer ISBN: 0-13-157551-1
This page intentionally left blank
DB2® pureXML® Cookbook Project Management with the
This page intentionally left blank
IBM WebSphere [SUBTITLE ] DB2® pureXML® Cookbook Deployment and Advanced Master the Power of the IBM Configuration ® Hybrid Data Server Roland Barcia, Bill Hines, Tom Alcott, and Keys Botzum Matthias Nicola Pav Kumar-Chatterjee IBM Press Pearson plc Upper Saddle River, NJ • Boston • Indianapolis • San Francisco New York • Toronto • Montreal • London • Munich • Paris • Madrid Cape Town • Sydney • Tokyo • Singapore • Mexico City Ibmpressbooks.com
The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. Before you use any IBM or non-IBM or open-source product mentioned in this book, make sure that you accept and adhere to the licenses and terms and conditions for any such product. © Copyright 2010 by International Business Machines Corporation. All rights reserved. Note to U.S. Government Users: Documentation related to restricted right. Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corporation. IBM Press Program Managers: Steven M. Stansel, Ellice Uffer Cover design: IBM Corporation Associate Publisher: Greg Wiegand Marketing Manager: Kourtnaye Sturgeon Publicist: Heather Fox Acquisitions Editor: Bernard Goodwin Managing Editor: Kristy Hart Designer: Alan Clements Project Editor: Andy Beaster Copy Editor: Paula Lowell Senior Indexer: Cheryl Lenser Compositor: Gloria Schurick Proofreader: Leslie Joseph Manufacturing Buyer: Dan Uhrig Published by Pearson plc Publishing as IBM Press IBM Press offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales, which may include electronic versions and/or custom covers and content particular to your business, training goals, marketing focus, and branding interests. For more information, please contact: U.S. Corporate and Government Sales 1-800-382-3419 corpsales@pearsontechgroup.com. For sales outside the U.S., please contact: International Sales international@pearson.com. The following terms are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both: IBM, the IBM logo, IBM Press, DB2, pureXML, z/OS, ibm.com, WebSphere, System z, developerWorks, InfoSphere, DRDA, Rational, AIX, OmniFind, i5/OS, Lotus, and DataPower. Microsoft, Windows, Microsoft Word, Microsoft Visual Studio, Visual Basic, and Visual C# are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc., in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.
Library of Congress Cataloging-in-Publication Data Nicola, Matthias. DB2 PureXML cookbook : master the power of IBM’s hybrid data server / Matthias Nicola and Pav Kumar-Chatterjee. p. cm. Includes indexes. ISBN-13: 978-0-13-815047-1 (hardback : alk. paper) ISBN-10: 0-13-815047-8 (hardback : alk. paper) 1. IBM Database 2. 2. XML (Document markup language) 3. Database management. I. Kumar-Chatterjee, Pav. II. Title. QA76.9.D3N525 2009 006.7’4—dc22 2009020222 All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permissions, write to: Pearson Education, Inc. Rights and Contracts Department 501 Boylston Street, Suite 900 Boston, MA 02116 Fax (617) 671 3447 ISBN-13: 978-0-13-815047-1 ISBN-10: 0-13-815047-8 Text printed in the United States on recycled paper at Edwards Brothers in Ann Arbor, Michigan. First printing August 2009
I would like to dedicate this book to Scott and Carrie in the hope that it will inspire them to work hard at school and to my mother who did not see the final version, but who gave me unconditional support as only a mother can. —Pav Kumar-Chatterjee
Contents Chapter1 Introduction 1 1.1 1.2 1.3 1.4 1.5 Anatomy of an XML Document Differences Between XML and Relational Data Overview of DB2 pureXML Benefits of DB2 pureXML over Alternative Storage Options for XML Data XML Solutions to Relational Data Model Problems 1.5.1 When the Schema Is Volatile 1.5.2 When Data Is Inherently Hierarchical in Nature 1.5.3 When Data Represents Business Objects 1.5.4 When Objects Have Sparse Attributes 1.5.5 When Data Needs to be Exchanged 1.6 Summary Chapter 2 2.1 2.2 2.3 2.4 2.5 Designing XML Data and Applications Choosing Between XML Elements and XML Attributes XML Tags versus Values Choosing the Right Document Granularity Using a Hybrid XML/Relational Approach Summary Chapter 3 Designing and Managing XML Storage Objects 3.1 Understanding XML Document Trees 3.2 Understanding pureXML Storage 3.3 XML Storage in DB2 for Linux, UNIX, and Windows 3.3.1 Storage Objects for XML Data 3.3.2 Defining Columns,Tables, and Table Spaces for XML Data 3.3.3 Dropping XML Columns 3.3.4 Improved XML Storage Format in DB2 9.7 3.4 Using XML Base Table Row Storage (Inlining) 3.4.1 Monitoring and Configuring XML Inlining 3.4.2 Potential Benefits and Drawbacks of XML Inlining 3.5 Compressing XML Data 3.6 Examining XML Storage Space Consumption 3.7 Reorganizing XML Data and Indexes 3.8 Understanding XML Space Management: A Comprehensive Example 3.9 XML in Range Partitioned Tables and MDC Tables 3.9.1 XML and Range Partitioning 3.9.2 XML and Multidimensional Clustering 3.10 XML in a Partitioned Database (DPF) 3.11 XML Storage in DB2 for z/OS xi 2 4 7 10 11 12 12 12 13 13 13 15 15 19 22 24 25 27 28 30 33 33 36 40 40 41 43 47 48 51 53 54 57 57 58 59 60
xii DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server 3.11.1 Storage Objects for XML Data 3.11.2 Characteristics of XML Table Spaces 3.11.3 Tables with Multiple XML Columns 3.11.4 Naming and Storage Conventions 3.12 Utilities for XML Objects in DB2 for z/OS 3.12.1 REPORT TABLESPACESET for XML 3.12.2 Reorganizing XML Data in DB2 for z/OS 3.12.3 CHECK DATA for XML 3.13 XML Parsing and Memory Consumption in DB2 for z/OS 3.13.1 Controlling the Memory Consumption of XML Operations 3.13.2 Redirecting XML Parsing to zIIP and zAAP 3.14 Summary Chapter 4 Inserting and Retrieving XML Data 4.1 Inserting XML Documents 4.1.1 Simple Insert Statements 4.1.2 Reading XML Documents from Files or URLs 4.2 Deleting XML Documents 4.3 Retrieving XML Documents 4.4 Handling Documents with XML Declarations 4.5 Copying Full XML Documents 4.6 Dealing with XML Special Characters 4.7 Understanding XML Whitespace and Document Storage 4.7.1 Preserving XML Whitespace 4.7.2 Changing the Whitespace Default from “Strip” to “Preserve” 4.7.3 Storing XML Documents for Compliance 4.8 Summary Chapter 5 Moving XML Data 5.1 Exporting XML Data in DB2 for Linux, UNIX, and Windows 5.1.1 Exporting XML Documents to a Single File 5.1.2 Exporting XML Documents as Individual Files 5.1.3 Exporting XML Documents as Individual Files with Non-Default Names 5.1.4 Exporting XML Documents to One or Multiple Dedicated Directories 5.1.5 Exporting Fragments of XML Documents 5.1.6 Exporting XML Data with XML Schema Information 5.2 Importing XML Data in DB2 for Linux, UNIX, and Windows 5.2.1 IMPORT Command and Input Files 5.2.2 Import/Insert Performance Tips 5.3 Loading XML Data in DB2 for Linux, UNIX, and Windows 5.4 Unloading XML Data in DB2 for z/OS 5.5 Loading XML Data in DB2 for z/OS 5.6 Validating XML Documents during Load and Insert Operations 5.7 Splitting Large XML Documents into Smaller Documents 5.8 Replicating and Publishing XML Data 61 63 64 64 65 67 68 69 71 71 72 73 75 76 76 79 82 83 85 86 87 89 91 93 94 95 97 98 98 100 102 102 104 105 106 107 108 109 111 114 116 116 118
Table of Contents xiii 5.9 Federating XML Data 5.10 Managing XML Data with HADR 5.11 Handling XML Data in db2look and db2move 5.12 Summary Chapter 6 Querying XML Data: Introduction and XPath 6.1 An Overview of Querying XML Data 6.2 Understanding the XQuery and XPath Data Model 6.2.1 Sequences 6.2.2 Sequence in, Sequence out 6.3 Sample Data for XPath, SQL/XML, and XQuery 6.4 Introduction to XPath 6.4.1 Analogy Between XPath and Navigating a File System 6.4.2 Simple XPath Queries 6.5 How to Execute XPath in DB2 6.6 Wildcards and Double Slashes 6.7 XPath Predicates 6.8 Existential Semantics 6.9 Logical Expressions with and, or, not() 6.10 The Current Context and the Parent Step 6.11 Positional Predicates 6.12 Union and Construction of Sequences 6.13 XPath Functions 6.14 General and Value Comparisons 6.15 XPath Axes and Unabbreviated Syntax 6.16 Summary Chapter 7 Querying XML Data with SQL/XML 7.1 Overview of SQL/XML 7.2 Retrieving XML Documents or Document Fragments with XMLQUERY 7.2.1 Referencing XML Columns in SQL/XML Functions 7.2.2 Retrieving Element Values Without XML Tags 7.2.3 Retrieving Repeating Elements with XMLQUERY 7.3 Retrieving XML Values in Relational Format with XMLTABLE 7.3.1 Generating Rows and Columns from XML Data 7.3.2 Dealing with Missing Elements 7.3.3 Avoiding Type Errors 7.3.4 Retrieving Repeating Elements with XMLTABLE 7.3.5 Numbering XMLTABLE Rows Based on Repeating Elements 7.3.6 Retrieving Multiple Repeating Elements at Different Levels 7.4 Using XPath Predicates in SQL/XML with XMLEXISTS 7.5 Common Mistakes with SQL/XML Predicates 7.6 Using Parameter Markers or Host Variables 7.7 XML Queries with Dynamically Computed XPath Expressions 120 121 122 123 125 126 128 128 130 131 132 133 133 137 140 142 147 148 151 153 154 155 156 157 157 159 160 161 162 163 164 165 165 167 168 169 173 174 177 181 183 185
xiv DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server 7.8 Ordering a Query Result Set Based on XML Values 7.9 Converting XML Values to Binary SQL Types 7.10 Summary Chapter 8 Querying XML Data with XQuery 8.1 XQuery Overview 8.2 Processing XML Data with FLWOR Expressions 8.2.1 Anatomy of a FLWOR Expression 8.2.2 Understanding the for and let Clauses 8.2.3 Understanding the where and order by Clauses 8.2.4 FLWOR Expressions with Multiple for and let Clauses 8.3 Comparing FLWOR Expressions, XPath Expressions, and SQL/XML 8.3.1 Traversing XML Documents 8.3.2 Using XML Predicates 8.3.3 Result Set Cardinalities in XQuery and SQL/XML 8.3.4 Using FLWOR Expressions in SQL/XML 8.4 Constructing XML Data 8.4.1 Constructing Elements with Computed Values 8.4.2 Constructing XML Data with Predicates and Conditions 8.4.3 Constructing Documents with Multiple Levels of Nesting 8.4.4 Constructing Documents with XML Aggregation in SQL/XML Queries 8.5 Data Types, Cast Expressions, and Type Errors 8.6 Arithmetic Expressions 8.7 XQuery Functions 8.7.1 String Functions 8.7.2 Number and Aggregation Functions 8.7.3 Sequence Functions 8.7.4 Namespace and Node Functions 8.7.5 Date and Time Functions 8.7.6 Boolean Functions 8.8 Embedding SQL in XQuery 8.9 Using SQL Functions and User-Defined Functions in XQuery 8.10 Summary Chapter 9 Querying XML Data:Advanced Queries & Troubleshooting 9.1 Aggregation and Grouping of XML Data 9.1.1 Aggregation and Grouping Queries with XMLTABLE 9.1.2 Aggregation of Values within and across XML Documents 9.1.3 Grouping Queries in SQL/XML versus XQuery 9.2 Join Queries with XML Data 9.2.1 XQuery Joins between XML Columns 9.2.2 SQL/XML Joins between XML Columns 9.2.3 Joins between XML and Relational Columns 9.2.4 Outer Joins between XML Columns 186 187 188 189 190 191 191 193 194 195 197 197 198 200 201 202 202 204 206 207 208 212 214 215 218 220 222 224 226 227 229 230 233 233 234 236 237 239 240 242 248 250
Table of Contents 9.3 Case-Insensitive XML Queries 9.4 How to Avoid “Bad” Queries 9.4.1 Construction of Excessively Large Documents 9.4.2 “Between” Predicates on XML Data 9.4.3 Large Global Sequences 9.4.4 Multilevel Nesting SQL and XQuery 9.5 Common Errors and How to Avoid Them 9.5.1 SQL16001N 9.5.2 SQL16002N 9.5.3 SQL16003N 9.5.4 SQL16005N 9.5.5 SQL16015N 9.5.6 SQL16011N 9.5.7 SQL16061N 9.5.8 SQL16075N 9.6 Summary Chapter 10 Producing XML from Relational Data 10.1 SQL/XML Publishing Functions 10.1.1 Constructing XML Elements from Relational Data 10.1.2 NULL Values, Missing Elements, and Empty Elements 10.1.3 Constructing XML Attributes from Relational Data 10.1.4 Constructing XML Documents from Multiple Relational Rows 10.1.5 Constructing XML Documents from Multiple Relational Tables 10.1.6 Comparing XMLAGG, XMLCONCAT, and XMLFOREST 10.1.7 Conditional Element Construction 10.1.8 Leading Zeros in Constructed Elements and Attributes 10.1.9 Default Tagging of Relational Data with XMLROW and XMLGROUP 10.1.10 GUI-Based Definition of SQL/XML Publishing Queries 10.1.11 Constructing Comments, Processing Instructions, and Text Nodes 10.1.12 Legacy Functions 10.2 Using XQuery Constructors with Relational Input 10.3 XML Declarations for Constructed XML Data 10.4 Inserting Constructed XML Data into XML Columns 10.5 Summary Chapter 11 Converting XML to Relational Data 11.1 Advantages and Disadvantages of Shredding 11.2 Shredding with the XMLTABLE Function 11.2.1 Hybrid XML Storage 11.2.2 Relational Views over XML Data 11.3 Shredding with Annotated XML Schemas 11.3.1 Annotating an XML Schema 11.3.2 Defining Schema Annotations Visually in IBM Data Studio xv 252 253 253 254 256 257 258 259 259 260 261 262 263 263 264 264 267 268 269 274 275 277 281 284 284 285 286 289 290 290 290 292 294 295 297 297 301 303 305 306 306 311
xvi DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server 11.3.3 Registering an Annotated Schema 11.3.4 Decomposing One XML Document at a Time 11.3.5 Decomposing XML Documents in Bulk 11.4 Summary Chapter 12 Updating and Transforming XML Documents 12.1 Replacing a Full XML Document 12.2 Modifying Documents with XQuery Updates 12.3 Updating the Value of an XML Node in a Document 12.3.1 Replacing an Element Value 12.3.2 Replacing an Attribute Value 12.3.3 Replacing a Value Using a Parameter Marker 12.3.4 Replacing Multiple Values in a Document 12.3.5 Replacing an Existing Value with a Computed Value 12.4 Replacing XML Nodes in a Document 12.5 Deleting XML Nodes from a Document 12.6 Renaming Elements or Atttributes in a Document 12.7 Inserting XML Nodes into a Document 12.7.1 Defining the Position of Inserted Elements 12.7.2 Defining the Position of Inserted Attributes 12.7.3 Insert Examples 12.8 Handling Repeating and Missing Nodes 12.9 Modifying Multiple XML Nodes in the Same Document 12.9.1 Snapshot Semantics and Conflict Situations 12.9.2 Converting Elements to Attributes and Vice Versa 12.10 Modifying XML Documents in Queries 12.11 Modifying XML Documents in Insert Operations 12.12 Modifying XML Documents in Update Cursors 12.13 XML Updates in DB2 for z/OS 12.14 Transforming XML Documents with XSLT 12.14.1 The XSLTRANSFORM Function 12.14.2 XML to HTML Transformation 12.15 Summary Chapter 13 Defining and Using XML Indexes 13.1 Defining XML Indexes 13.1.1 Unique XML Indexes 13.1.2 Lean XML Indexes 13.1.3 Using the DB2 Control Center to Create XML Indexes 13.2 XML Index Data Types 13.2.1 VARCHAR(n) 13.2.2 VARCHAR HASHED 13.2.3 DOUBLE and DECFLOAT 13.2.4 DATE and TIMESTAMP 311 312 315 318 321 322 324 326 326 327 328 328 329 331 333 334 335 335 336 337 340 343 343 345 346 349 350 351 352 353 356 358 361 362 364 365 366 367 367 368 369 369
Table of Contents 13.3 13.4 13.5 13.6 13.7 13.8 13.9 13.2.5 Choosing a Suitable Index Data Type 13.2.6 Rejecting Invalid Values Using XML Indexes to Evaluate Query Predicates 13.3.1 Understanding Index Eligibility 13.3.2 Data Types in XML Indexes and Query Predicates 13.3.3 Text Nodes in XML Indexes and Query Predicates 13.3.4 Wildcards in XML Indexes and Query Predicates 13.3.5 Using Indexes for Structural Predicates XML Indexes and Join Predicates XML Indexes on Non-Leaf Elements Special Cases Where XML Indexes Cannot be Used 13.6.1 Special Cases with XMLQUERY 13.6.2 Parent Steps 13.6.3 The let and return Clauses XML Index Internals 13.7.1 XML Index Keys 13.7.2 Logical and Physical XML Indexes XML Index Statistics Summary Chapter 14 XML Performance and Monitoring 14.1 Explaining XML Queries in DB2 for Linux,UNIX, and Windows 14.1.1 The Explain Tables in DB2 for Linux, UNIX, and Windows 14.1.2 Using db2exfmt to Obtain Access Plans 14.1.3 Using Visual Explain to Display Access Plans 14.1.4 Access Plan Operators 14.1.5 Understanding and Analyzing XML Query Execution Plans 14.2 Explaining XML Queries in DB2 for z/OS 14.2.1 The Explain Tables in DB2 for z/OS 14.2.2 Obtaining Access Plan Information in SPUFI 14.2.3 Using Visual Explain to Display Access Plans 14.2.4 Access Plan Operators 14.2.5 Understanding and Analyzing XML Query Execution Plans 14.3 Statistics Collection for XML Data 14.3.1 Statistics Collection for XML Data in DB2 for z/OS 14.3.2 Statistics Collection for XML Data in DB2 for Linux, UNIX, and Windows 14.3.3 Examining XML Statistics with db2cat 14.4 Monitoring XML Activity 14.4.1 Using the Snapshot Monitor in DB2 for Linux, UNIX, and Windows 14.4.2 Monitoring Database Utilities 14.5 Best Practices for XML Performance 14.5.1 XML Document Design 14.5.2 XML Storage xvii 369 371 373 373 374 375 376 377 379 383 385 385 385 386 387 387 389 390 393 395 396 396 397 400 401 403 409 409 410 411 413 414 417 417 418 419 424 424 427 428 428 429
xviii DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server 14.5.3 XML Queries 14.5.4 XML Indexes 14.5.5 XML Updates 14.5.6 XML Schemas 14.5.7 XML Applications 14.6 Summary Chapter 15 Managing XML Data with Namespaces 15.1 Introduction to XML Namespaces 15.1.1 Namespace Declarations in XML Documents 15.1.2 Default Namespaces 15.2 Exploring Namespaces in XML Documents 15.3 Querying XML Data with Namespaces 15.3.1 Declaring Namespaces in XML Queries 15.3.2 Using Namespace Declarations in SQL/XML Queries 15.3.3 Using Namespaces in the XMLTABLE Function 15.3.4 Dealing with Multiple Namespaces per Document 15.4 Creating Indexes for XML Data with Namespaces 15.5 Constructing XML Data with Namespaces 15.5.1 SQL/XML Publishing Functions and Namespaces 15.5.2 XQuery Constructors and Namespaces 15.6 Updating XML Data with Namespaces 15.6.1 Updating Values in Documents with Namespaces 15.6.2 Renaming Nodes in Documents with Namespace Prefixes 15.6.3 Renaming Nodes in Documents with Default Namespaces 15.6.4 Inserting and Replacing Nodes in Documents with Namespaces 15.7 Summary Chapter 16 Managing XML Schemas 16.1 Introduction to XML Schemas and Their Usage 16.1.1 Valid Versus Well-Formed XML Documents 16.1.2 To Validate or Not to Validate,That Is the Question! 16.1.3 Custom Versus Industry Standard XML Schemas 16.2 Anatomy of an XML Schema 16.3 An XML Schema with Include and Import 16.4 Registering XML Schemas 16.4.1 Registering XML Schemas in the DB2 Command Line Processor 16.4.2 Registering XML Schemas from Applications via Stored Procedures 16.4.3 Registering XML Schemas from Java Applications via JDBC 16.4.4 Two XML Schemas Sharing a Common Schema Document 16.4.5 Error Situations and How to Resolve Them 16.5 Removing XML Schemas from the Schema Repository 430 432 433 434 434 435 437 437 439 442 444 447 448 451 452 454 456 460 460 462 463 464 465 467 468 469 471 472 473 474 474 476 479 483 484 486 488 489 490 492
Table of Contents 16.6 XML Schema Evolution 16.6.1 Schema Evolution Without Document Validation 16.6.2 Generic Schema Evolution with Document Validation 16.6.3 Compatible Schema Evolution with the UPDATE XMLSCHEMA Command 16.7 Granting and Revoking XML Schema Usage Privileges 16.8 Document Type Definitions (DTDs) and External Entities 16.9 Browsing the XML Schema Repository (XSR) 16.9.1 Tables and Views of the XML Schema Repository 16.9.2 Queries against the XML Schema Repository 16.10 XML Schema Considerations in DB2 for z/OS 16.11 Summary Chapter 17 Validating XML Documents against XML Schemas 17.1 17.2 17.3 17.4 17.5 17.6 17.7 Document Validation Upon Insert Document Validation Upon Update Validation without Rejecting Invalid Documents Enforcing Validation with Check Constraints Automatic Validation with Triggers Diagnosing Validation and Parsing Errors Validation during Load and Import Operations 17.7.1 Validation against a Single XML Schema 17.7.2 Validation against Multiple XML Schemas 17.7.3 Using a Default XML Schema 17.7.4 Overriding XML Schema References 17.7.5 Validation Based on schemaLocation Attributes 17.8 Checking Whether an Existing Document Has Been Validated 17.9 Validating Existing Documents in a Table 17.10 Finding the XML Schema for a Validated Document 17.11 How to Undo Document Validation 17.12 Considerations for Validation in DB2 for z/OS 17.12.1 Document Validation Upon Insert 17.12.2 Document Validation Upon Update 17.12.3 Validating Existing Documents in a Table 17.12.4 Summary of Platform Similarities and Differences 17.13 Summary Chapter 18 Using XML in Stored Procedures, UDFs, and Triggers 18.1 Manipulating XML in SQL Stored Procedures 18.1.1 Basic XML Manipulation in Stored Procedures 18.1.2 A Stored Procedure to Store XML in a Hybrid Manner 18.1.3 Loops and Cursors 18.1.4 A Stored Procedure to Update a Selected XML Element or Attribute 18.1.5 Three Tips for Testing Stored Procedures xix 493 494 494 495 499 501 502 503 508 510 512 513 514 518 519 520 523 525 530 530 531 532 532 534 534 535 538 540 540 541 542 543 543 544 547 548 548 550 553 554 555
xx DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server 18.2 Manipulating XML in User-Defined Functions 18.2.1 A UDF to Extract an Element or Attribute Value 18.2.2 A UDF to Extract the Values of a Repeating Element 18.2.3 A UDF to Shred XML Data to a Relational Table 18.2.4 A UDF to Modify an XML Document 18.3 Manipulating XML Data with Triggers 18.3.1 Insert Triggers on Tables with XML Columns 18.3.2 Delete Triggers on Tables with XML Columns 18.3.3 Update Triggers on XML Columns 18.4 Summary Chapter 19 Performing Full-Text Search 19.1 Overview of Text Search in DB2 19.2 Sample Table and Data 19.3 Enabling a Database for the DB2 Net Search Extender 19.4 Managing Full-Text Indexes with the DB2 Net Search Extender 19.4.1 Creating Basic Text Indexes 19.4.2 Creating Text Indexes with Specific Storage Paths 19.4.3 Creating Text Indexes with a Periodic Update Schedule 19.4.4 Creating Text Indexes for Specific Parts of Each Document 19.4.5 Creating Text Indexes with Advanced Options 19.4.6 Updating and Reorganizing Text Indexes 19.4.7 Altering Text Indexes 19.5 Performing XML Full-Text Search with the DB2 Net Search Extender 19.5.1 Full-Text Search in SQL and XQuery 19.5.2 Full-Text Search with Boolean Operators 19.5.3 Full-Text Search with Custom Document Models 19.5.4 Advanced Search with Proximity, Fuzzy, and Stemming Options 19.5.5 Finding the Correct Match within an XML Document 19.5.6 Search Conditions on Sibling Branches of an XML Document 19.5.7 Text Search in the Presence of Namespaces 19.6 DB2 Text Search 19.6.1 Enabling a Database for DB2 Text Search 19.6.2 Creating and Maintaining Full-Text Indexes for DB2 Text Search 19.6.3 Writing DB2 Text Search Queries for XML Data 19.6.4 Full-Text Search with XPath Expressions 19.6.5 Full-Text Search with Wildcards 19.7 Summary of Text Search Administration Commands 19.8 XML Full-Text Search in DB2 for z/OS 19.9 Summary 556 557 557 558 559 561 562 563 564 564 567 568 570 571 572 572 573 574 576 578 579 580 581 581 583 585 586 587 588 588 590 590 591 592 593 594 594 596 596
Table of Contents Chapter 20 Understanding XML Data Encoding 20.1 Understanding Internal and External XML Encoding 20.1.1 Internally Encoded XML Data 20.1.2 Externally Encoded XML Data 20.2 Avoiding Code Page Conversions 20.3 Using Non-Unicode Databases for XML 20.4 Examples of Code Page Issues 20.4.1 Example 1: Chinese Characters in a Non-Unicode Code Page ISO-8859-1 20.4.2 Example 2: Fetching Data from a Non-Unicode Code Database into a Character Type Application Variable 20.4.3 Example 3: Encoding Issues with XMLTABLE and XMLCAST 20.4.4 Example 4: Japanese Literal Values in a Non-Unicode Database 20.4.5 Example 5: Data Expansion and Shrinkage Due to Code Page Conversion 20.5 Avoiding Data Loss and Encoding Errors in Non-Unicode Databases 20.6 Summary Chapter 21 Developing XML Applications with DB2 21.1 The Value of DB2 pureXML for Application Development 21.1.1 Avoid XML Parsing in the Application Layer 21.1.2 Storing Business Objects in an Intuitive Format 21.1.3 Rapid Prototyping 21.1.4 Responding Quickly to Changing Business Needs 21.2 Using Parameter Markers or Host Variables 21.3 Java Applications 21.3.1 XML Support in JDBC 3.0 21.3.2 XML Support in JDBC 4.0 21.3.3 Comprehensive Example of Manipulating XML Data with JDBC 4.0 21.3.4 Creating XML Documents from Application Data 21.3.5 Binding XML Data to Java Objects 21.3.6 IBM pureQuery 21.4 .NET Applications 21.4.1 Querying XML Data in .NET Applications 21.4.2 Manipulating XML Data in .NET Applications 21.4.3 Inserting XML Data from .NET Applications 21.4.4 XML Schema and DTD Handling in .NET Applications 21.5 CLI Applications 21.6 Embedded SQL Applications 21.6.1 COBOL Applications with Embedded SQL 21.6.2 PL/1 Applications with Embedded SQL 21.6.3 C Applications with Embedded SQL 21.7 PHP Applications xxi 597 599 599 600 601 601 602 602 603 604 605 605 606 606 609 610 610 612 612 613 613 615 615 619 621 627 629 629 631 632 633 635 636 636 639 640 643 645 647
xxii DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server 21.8 Perl Applications 21.9 XML Application Development Tools 21.9.1 IBM Data Studio Developer 21.9.2 IBM Database Add-ins for Visual Studio 21.9.3 Altova XML Tools 21.9.4 <oXygen/> 21.9.5 Stylus Studio 21.10 Summary Chapter 22 Exploring XML Information in the DB2 Catalog 22.1 XML-Related Catalog Information in DB2 for Linux, UNIX, and Windows 22.1.1 Catalog Information for XML Columns 22.1.2 The XML Strings and Paths Tables 22.1.3 The Internal XML Regions and Path Indexes 22.1.4 Catalog Information for User-Defined XML Indexes 22.1.5 Catalog Information for XML Schemas 22.2 XML-Related Catalog Information in DB2 for z/OS 22.2.1 Catalog Information for XML Storage Objects 22.2.2 Catalog Information for XML Indexes 22.2.3 Catalog Information for XML Schemas 22.3 Summary Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz 23.1 Designing XML Data and Applications 23.2 Designing and Managing Storage Objects for XML 23.3 Inserting and Retrieving XML Data 23.4 Moving XML Data 23.5 Querying XML 23.6 Producing XML from Relational Data 23.7 Converting XML to Relational Data 23.8 Updating and Transforming XML Documents 23.9 Defining and Using XML Indexes 23.10 XML Performance and Monitoring 23.11 Managing XML Data with Namespaces 23.12 XML Schemas and Validation 23.13 Performing Full-Text Search 23.14 XML Application Development 23.15 Answers Appendix A Getting Started with DB2 pureXML A.1 Exploring the Structure of XML Documents A.1.1 Exploring XML Documents in the DB2 Control Center A.1.2 Exploring XML Documents in the CLP A.1.3 Exploring XML Documents in SPUFI A.2 Tips for Running XML Operations in the CLP 650 651 652 656 656 658 659 659 661 661 661 662 663 664 667 667 667 671 672 673 675 675 677 680 681 682 686 687 688 689 692 693 694 696 697 700 703 703 703 704 705 706
Table of Contents Appendix B The XML Sample Database B.1 B.2 B.3 B.4 B.5 XML Sample Database on DB2 for Linux, UNIX, and Windows XML Sample Tables on DB2 for z/OS Table customer—Column info Table product—Column description Table purchaseorder—Column porder Appendix C Further Reading C.1 General Resources for All Chapters C.2 Chapter-Specific Resources C.3 Resources on the Integration of DB2 pureXML with Other Products Index xxiii 709 709 710 710 712 713 717 717 718 726 727
This page intentionally left blank
Foreword n the years since E.F. Codd’s groundbreaking work in the 1970s, relational database systems have become ubiquitous in the business world. Today, most of the world’s business data is stored in the rows and columns of relational databases. The relational model is ideally suited to applications in which data has a relatively simple and uniform structure, and in which database structure evolves much more slowly than data values. I With the advent of the Web, however, big changes began to occur in the database world, driven by globalization and by dramatic reductions in the cost of storing, transmitting, and processing data. Today, businesses are globally interconnected and exchange large volumes of data with customers, suppliers, and governments. Much of this data consists of things that do not fit neatly into rows and columns, such as medical records, legal documents, incident reports, tax returns, and purchase orders. The new kinds of data tend to be more heterogeneous than traditional business data, having more variation and a more rapidly evolving structure. In response to the changing requirements of business data, a new generation of standards have appeared. XML has emerged as an international standard for the exchange of self-describing data, unifying structured, unstructured, and semi-structured information formats. XML Schema has been adopted as the metadata syntax for describing the structure of XML documents. Industry-specific XML schemas have been developed for medical, insurance, retail, publishing, banking, and other industries. XPath and XQuery have been adopted as standard languages for retrieving and manipulating data in XML format, and new facilities have been added to the SQL standard for interfacing between relational and XML data. In DB2, the new generation of XML-related standards is reflected in pureXML, a broad new set of XML functionality implemented in both DB2 for z/OS and DB2 for Linux, UNIX, and Windows. pureXML bridges the gap between the XML and relational worlds and makes DB2 a true hybrid database management system. DB2 pureXML stores and indexes XML data alongside relational data in a highly efficient new storage format, and supports XML query languages such as XPath and XQuery alongside the traditional SQL. pureXML is perhaps the largest new package of functionality in the history of DB2, impacting nearly every aspect of the system. The implementation of pureXML required deep changes in the database kernel, optimization methods, database administrator tools, system utilities, and application programming interfaces. New facilities were added for registering XML schemas and using them to validate stored documents. New kinds of statistics on XML documents had to be gathered and exploited. Facilities for replicated, federated, and partitioned databases had to be updated to accommodate the new XML storage format. pureXML provides DB2 users with a new level of capability, but using this capability to full advantage requires users to have a new level of sophistication. A new user of pureXML is xxv
xxvi DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server confronted with many complex choices. What kinds of data should be represented in XML rather than in normalized tables? How can data be converted between XML and relational formats? How can a hybrid database be designed to take advantage of both data formats? What are the most appropriate uses for SQL, XQuery, and XPath? What kinds of indexes should be maintained on XML data? What is the XML equivalent of a NULL value? These and many other questions are considered in detail in the DB2 pureXML Cookbook. Matthias Nicola has been deeply involved in the design and implementation of DB2 pureXML since its inception. As a Senior Engineer at IBM’s Silicon Valley Laboratory, his work has focused on measuring and optimizing the performance of new storage and indexing techniques for XML. After the release of pureXML, he worked with many IBM customers and business partners to create, deploy, and optimize XML applications for government, banking, telecommunications, retail, and other industries. Pav Kumar-Chatterjee is a technical specialist with many years of experience in consulting with IBM customers throughout the UK and Europe on developing and deploying DB2 and XML solutions. Through their work with customers, Matthias and Pav have learned how to explain concepts clearly and how to identify and avoid common pitfalls in the application development process. They have also developed a set of “best practices” that they have shared at numerous conferences, classes, workshops, and customer engagements. Between them, Matthias and Pav have accumulated all the knowledge and experience you need to successfully create and deploy solutions using DB2 pureXML. Their expertise is encapsulated in this book in the form of hundreds of practical examples, tested and clearly explained. The book also includes a comprehensive set of questions to test your understanding. DB2 pureXML Cookbook includes both an introduction to basic XML concepts and a comprehensive description of the XML-related features of DB2 for z/OS and DB2 for Linux, UNIX, and Windows. Chapters are organized around tasks that reflect the lifecycle of XML projects, including designing databases, loading and validating data, writing queries and updates, developing applications, optimizing performance, and diagnosing problems. Each topic provides a clear progression from introductory material to more advanced concepts. The writing style is informal and easy to understand for both beginners and experts. If you are an application developer, database administrator, or system architect, this is the book you need to gain a comprehensive understanding of DB2 pureXML. Don Chamberlin IBM Fellow, Emeritus Almaden Research Center April 10, 2009
Preface n recent years XML has continued to emerge as the de-facto standard for data exchange, because it is flexible, extensible, self-describing, and suitable for any combination of structured and unstructured data. With the increasing use of XML as a pervasive data format, there is a growing need to store, index, query, update, and validate XML documents in database systems. In response to this demand, IBM has developed sophisticated XML data management capabilities that are deeply integrated in the DB2 database system. This novel technology is called DB2 pureXML and is available in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. With pureXML, DB2 has evolved into a hybrid database system that allows you to manage both XML and relational data in a tightly integrated manner. I The DB2 pureXML Cookbook provides the single most comprehensive coverage of DB2’s pureXML functionality in DB2 for Linux, UNIX, and Windows as well as DB2 for z/OS. This book is a “cookbook” because it is more than just a description of functions and features (“ingredients”). This book provides “recipes” that show you how to combine the pureXML ingredients to efficiently perform typical user tasks for managing XML data. This book explains DB2 pureXML in more than 700 practical examples, including 250+ XQuery and SQL/XML queries, taking you from simple introductions all the way to advanced scenarios, tuning, and troubleshooting. Since the first release of DB2 pureXML in 2006 we have worked with numerous companies to help them design, implement, optimize, and deploy XML applications with DB2. In this book we have distilled our experience from these pureXML projects so that you can benefit from proven implementation techniques, best practices, tips and tricks, and performance guidelines that are not described elsewhere. WHO SHOULD READ THIS BOOK? This book is written for database administrators, application developers, IT architects, and everyone who wants to get a deep technical understanding of DB2’s pureXML technology and how to use it most effectively. As a DBA you will learn, for example, how to design and manage XML storage objects, how to index XML data, where to find XML-related information in the DB2 catalog, and how to mange XML with DB2 utilities. Application developers learn, among other things, how to write XML queries and XML updates with XPath, SQL/XML, and XQuery, and how to code XML applications with Java, .NET, C, COBOL, PL/1, PHP, or Perl. This book is suitable for both beginners and experts. Each topic starts with simple examples, which provide an easy introduction, and works towards advanced concepts and solutions to complex problems. Extensive XML knowledge is not required to read this book because it includes the necessary introductions to XML, XPath, XQuery, XML Schema, and namespaces. These xxvii
xxviii DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server concepts are explained through numerous examples that are easy to follow. We assume that you have some experience with relational databases and SQL, but we show all the relevant DB2 commands that are required to work through the examples in this book. Appendix C, Further Reading, also contains links to additional educational material about both DB2 and XML. COVERAGE OF DB2 FOR Z/OS AND DB2 FOR LINUX, UNIX, AND WINDOWS IN THIS BOOK The book describes DB2 pureXML on all supported platforms and versions, which at the time of writing are DB2 9 for z/OS as well as DB2 9.1, 9.5, and 9.7 for Linux, UNIX, and Windows. Many pureXML features and functions are identical across DB2 for Linux, UNIX, and Windows and DB2 for z/OS. Where platform-specific differences exist we point them out along the way. However, this book does not intend to be a reference that lists all functions and features according to platform and version of DB2. Instead, this book is a “cookbook” that focuses on concepts, examples, and best practices. The capabilities in DB2 for z/OS and DB2 for Linux, UNIX, and Windows continue to grow and converge over time. For the latest information on which feature is available in which version, please consult the respective DB2 information center. DB2 for z/OS also continues to deliver pureXML enhancements via APARs. Please look at APAR II14426, which is an informational APAR that summarizes and links all other XML-related APARs for DB2 on z/OS. In our work with users who adopt DB2 pureXML we have made the following observation: Some of the users who begin to use DB2 pureXML on Linux, UNIX, and Windows have little or no prior experience with DB2. In contrast, most users who are interested in DB2 pureXML on z/OS are already familiar with DB2 for z/OS in general. This difference is reflected in this book; that is, we describe some DB2 concepts, such as monitoring or the use of DB2 utilities, in more detail for DB2 for Linux, UNIX, and Windows than for DB2 for z/OS. DO IT YOURSELF! The best way to learn a new technology is hands-on. We strongly recommend that you download DB2 Express-C, which is free, and try the concepts that you learn in this book in DB2’s sample database. Appendixes A and B contain the necessary information to get you started. DON’T HESITATE TO ASK QUESTIONS! If any pureXML question is not covered in this book, the fastest way to get an answer is to post a question in the DB2 pureXML forum at http://www.ibm.com/developerworks/forums/forum. jspa?forumID=1423. Whether you seek clarification about specific features or functions, or if you need help with a tricky query, this forum is the right place to ask for help. You are also welcome to contact the
Preface xxix authors directly. If you want to discuss an XML project or if you have comments or feedback on the material in this book—we will be happy to hear from you. Please contact Matthias at mnicola@us.ibm.com and Pav at kumarp2@uk.ibm.com. HOW THIS BOOK IS STRUCTURED The DB2 pureXML Cookbook takes you through the different tasks and topics that you typically encounter during the life cycle of an XML project. The structure of this book with its 23 chapters is the following: Planning Chapter 1, Introduction, provides an overview of XML and its differences to relational data, and discusses scenarios where XML has advantages over the relational model. This chapter also includes a summary of the pureXML technology. Chapter 2, Designing XML Data and Applications, covers fundamental XML design questions such as choosing between XML elements and attributes, selecting an appropriate XML document granularity, and deciding on a “good” mix of XML and relational data for your application. Designing and Populating an XML Database Chapter 3, Designing and Managing XML Storage Objects, first explains the tree representation of XML documents and how they are physically stored in DB2. Then it describes how to create and manage tables and table spaces for XML, including compression, reorganization, and partitioning. Chapter 4, Inserting and Retrieving XML Data, looks at “full document” operations such as insert, delete, and retrieval of XML documents. This chapter also explains how to handle XML declarations, white space, and reserved characters in XML documents. Chapter 5, Moving XML Data, looks at importing, exporting, loading, replicating, and federating XML data in DB2. A technique to split large XML documents into smaller ones is also demonstrated. Querying XML Data Chapter 6, Querying XML Data: Introduction and XPath, is the first of four chapters on querying XML data. This chapter provides an overview of the different options for querying XML, introduces the XPath and XQuery data model, and describes the XPath language in detail. These concepts are fundamental for the subsequent chapters.
xxx DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server Chapter 7, Querying XML Data with SQL/XML, explains how XPath can be included in SQL statements with the SQL/XML functions XMLQUERY and XMLTABLE and the XMLEXISTS predicate. The use of SQL/XML is illustrated through a rich collection of examples and a discussion of common mistakes and how to avoid them. Chapter 8, Querying XML Data with XQuery, introduces the XQuery language, which is a superset of XPath. Among other things, this chapter describes XQuery FLWOR expressions, combinations of SQL and XQuery, and a comparison of XPath, XQuery, and SQL/XML. Chapter 9, Querying XML Data: Advanced XML Queries and Troubleshooting, takes querying XML data to the expert level. It demonstrates how to perform grouping, aggregation, and joins over XML data or a mix of XML and relational data. The troubleshooting section discusses “bad” XML queries, common errors, and how to avoid both. Converting, Updating, and Transforming Chapter 10, Producing XML from Relational Data, begins the discussion of converting, updating, and transforming data. This chapter explains how to read relational data from existing database tables and construct XML documents from it. Chapter 11, Converting XML to Relational Data, describes the opposite of Chapter 10, that is, the process of decomposing or shredding XML documents into relational tables. Two shredding methods are discussed, one using the XMLTABLE function and the other using annotated XML Schemas. Chapter 12, Updating and Transforming XML Documents, covers three techniques for updating XML documents: Full document replacement, XSLT transformations, and the XQuery Update Facility that allows you to modify, insert, delete, or rename individual elements and attributes within an XML document. Performance and Monitoring Chapter 13, Defining and Using XML Indexes, is one of two chapters dedicated to performance. It describes how to create XML indexes to improve query performance and explains under which conditions query predicates can or cannot use XML indexes. Chapter 14, Performance and Monitoring, looks at analyzing the performance of XML operations with particular emphasis on understanding XML query access plans. A summary of best practices for XML performance in DB2 is also provided.
Preface xxxi Ensuring Data Quality Chapter 15, Managing XML Data with Namespaces, introduces XML namespaces and explains how they avoid naming conflicts and ambiguity, thus contributing to data quality. This chapter illustrates how to index, query, update, and construct XML documents that contain namespaces. Chapter 16, Managing XML Schemas, first describes how XML Schemas can constrain XML documents in terms of their structure, element and attribute names, data types, and other characteristics. Then this chapter walks you through the concepts of registering, managing, and evolving XML Schemas in DB2. Chapter 17, Validating XML Documents against XML Schemas, concentrates on the validation of XML documents to ensure XML data quality in DB2. You can validate XML documents in INSERT and UPDATE statements, queries, and import and load operations. Application Development Chapter 18, Using XML in Stored Procedures, UDFs, and Triggers, demonstrates how you can implement application-specific processing logic with XML manipulation in SQL stored procedures, user-defined functions, and triggers. Chapter 19, Performing Full-Text Search, describes how the DB2 Net Search Extender and DB2 Text Search support efficient full-text search in collections of XML documents. Chapter 20, Understanding XML Data Encoding, explains internal and external XML encoding, how DB2 determines and handles XML encoding, and how you can avoid code page conversion. Chapter 21, Developing XML Application with DB2, contains techniques and best practices for application programs that exchange XML data with the DB2 server. Code samples are provided for Java, .NET, C, COBOL, PL/1, PHP, and Perl programmers. Reference Material Chapter 22, Exploring XML Information in the DB2 Catalog, is a guide to how XML storage objects, XML indexes, and XML Schemas are listed in the database catalog. Chapter 23, Test Your Knowledge—The DB2 pureXML Quiz, offers 82 questions to revisit specific topic areas. The Appendixes list supporting information and further reading for each chapter.
This page intentionally left blank
Acknowledgments Writing this book would not have been possible without the support from many people. For their support and technical reviews we would like to thank Andrew Eisenberg, Andy Lai, Bert van der Linden, Bob Harbus, Christian Daser, Cindy Saracco, Craig Mullins, Daniela Wersin, David Salinero, Don Chamberlin, Guogen Zhang, Henrik Loeser, Holger Seubert, Ian Cook, Jan-Eike Michels, Jason Cu, John Pickford, Lan Huang, Manfred Paessler, Mark Mezofenyi, Martin Sommerlandt, Paul Fletcher, Phil Nelson, Qi Jin, Shantanu Munkur, Stefan Momma, Susan Gausden, Susan Malaika, Susan Visser, Susanne Englert, Thomas Fanghaenel, Tiffany Money, Tim Kiefer, and Yuchu Tong. Thanks also to the many talented people in the DB2 pureXML development team who have implemented this exciting technology that we have the privilege of writing about. xxxiii
About the Authors Matthias Nicola is a Senior Software Engineer for DB2 pureXML at IBM’s Silicon Valley Lab. His work focuses on all aspects of XML in DB2, including XQuery, SQL/XML, XML storage, indexing, and performance. Matthias also works closely with customers and business partners, assisting them in the design, implementation, and optimization of XML solutions. Matthias has published more than a dozen articles on various XML topics (see www.matthiasnicola.de) and is a frequent speaker at DB2 conferences. Prior to joining IBM, Matthias worked on data warehousing performance for Informix Software. He received his doctorate in computer science from the Technical University of Aachen, Germany. Pav Kumar-Chatterjee has worked with DB2 since 1991 on DB2 for z/OS and since 2000 on DB2 for Linux, UNIX, and Windows. He is currently employed by IBM as a technical sales specialist for Information Management in the United Kingdom. He has helped customers implement the XML Extender product with DB2 V8 and has presented on DB2 and XML in the United Kingdom and around Europe. xxxiv
C H A P T E R 1 Introduction ML, the eXtensible Markup Language, is the standard format for exchanging information between different systems, applications, and organizations. XML is also the underlying data format for many web applications, Service-Oriented Architectures (SOA), and messagebased transaction processing systems. Enterprise application integration (EAI), enterprise information integration (EII), web services, the enterprise message bus (ESB), and standardization efforts in many vertical industries all rely on XML as the underlying technology for data exchange. X Organizations as well as entire industries have standardized XML Schemas to promote and simplify data exchange and are evolving those schemas to meet changing business needs. Many industry-specific initiatives as well as regulatory requirements are driving the adoption of XML. As more business transactions are conducted through web-based interfaces and electronic forms, government agencies and commercial enterprises face increasing requirements for preserving and post-processing the original transaction records. XML provides a straightforward means of capturing and maintaining the data associated with such electronic transactions. XML uses tags to define elements and attributes that hold business data. The element and attribute tags describe the intended meaning of the data items, and the nesting of the tags describes hierarchical relationships between the data items. Hence, XML is a self-describing data format. Data and metadata are tightly integrated in a vendor- and platform-independent format. These properties make XML well-suited for data exchange. Additionally, new tags can be invented and easily added. This extensibility allows XML to accommodate ever-evolving business needs. XML is a flexible data model that is suited for any combination of structured, unstructured, and semi-structured data. Also, XML documents can be modified and transformed, even into other 1
2 Chapter 1 Introduction formats such as HTML. Furthermore, the consistency of XML documents can easily be verified with an XML Schema. All this has become possible through widely available standards and tools such as XML parsers, XSLT, XPath, XQuery, and XML Schema. They greatly relieve applications from the burden of dealing with proprietary data formats. In an era where message formats, business forms, processes, and services change frequently, XML often reduces the cost and time it takes to react to such changes and to maintain databases and application logic correspondingly. Beyond XML for data exchange, enterprises are keeping large amounts of business-critical data permanently in XML format. This practice has various reasons. Some businesses must retain XML documents in their original format for auditing and regulatory compliance. Common examples include legal and financial documents as well as electronic forms. Another reason for using XML as a permanent storage format is that XML can be a more suitable data model than a relational schema. If business objects are inherently complex, hierarchical, semi-structured, or highly variable in nature, the flexibility of XML offers advantages over a rigorously defined relational database schema. Accustomed to the benefits of mature relational databases, many users expect the same capabilities for XML data, such as the ability to persist, query, index, update, and validate XML data with full ACID (Atomicity, Consistency, Isolation, Durability) compliance, recoverability, high availability, and high performance. DB2 pureXML is the answer. The subsequent discussion in this chapter is structured along the following topics: • Brief introduction to XML as a data format (section 1.1) • Differences between XML and relational data (section 1.2) • Overview of DB2 pureXML and its capabilities for managing XML data (section 1.3) • Advantages of DB2 pureXML over alternative storage options for XML (section 1.4) • Sample scenarios where XML can offer advantages over relational data (section 1.5) 1.1 ANATOMY OF AN XML DOCUMENT In this section we illustrate the most important parts of an XML document. A complete and exhaustive discussion of the XML standard is outside the scope of this book. Pointers to textbooks and tutorials about XML are provided in Appendix C, Further Reading. Let’s look at the XML document in Figure 1.1 as an example. The first line of the document contains the optional XML declaration. It indicates that this document follows the XML 1.0 standard, which is most commonly used. Besides XML 1.0, the only other version of XML is currently XML 1.1, which is very rarely used. We only consider XML 1.0 in this book. The XML declaration of the sample document in Figure 1.1 also carries an optional encoding declaration. Encoding concepts are discussed in Chapter 20, Understanding XML Data Encoding.
1.1 Anatomy of an XML Document 3 An XML document consists of elements and their attributes. Each element consists of a start tag and an end tag. These tags are enclosed in angle brackets. For example, the third line of the document shows a start tag <name> and an end tag </name>. Together they define a single XML element, the name element. The characters between the start and the end tag, Larry Menard, represent the value or the content of this element. Every start tag of an element must have a corresponding end tag. Elements can contain other elements, which means that tags can be nested. For example, the element addr contains the elements street, city, prov-state, and pcode-zip. Nesting builds hierarchical structures and expresses relationships between the elements. Elements can occur multiple times, in which case they are called repeating elements. For example, the phone element is a repeating element. It occurs multiple times because a single customer can have multiple phone numbers. Nested and repeating elements express one-to-many relationships between data items. <?xml version="1.0" encoding="UTF-8" ?> XML and encoding declaration Attribute <customerinfo xmlns="http://posample.org" Cid="1005"> <name>Larry Menard</name> <addr country="Canada"> Start tag of the root element Namespace declaration <street>223 NatureValley Road</street> <city>Toronto</city> <prov-state>Ontario</prov-state> Element Element value (text node) <pcode-zip>M4C 5K8</pcode-zip> </addr> Attribute name <phone type="work">905-555-9146</phone> <phone type="home">416-555-6121</phone> <!-- this is comment --> </customerinfo> Attribute value Comment End tag of the root element Figure 1.1 Anatomy of an XML document Elements can also contain one or multiple attributes within their start tag. Attributes are used to attach additional information to elements. They consist of an attribute name, the equal sign (=), and a value in quotes. For example, the element addr has an attribute country whose value is
4 Chapter 1 Introduction Canada. Similarly, each occurrence of the element phone has an attribute type. Attribute values must be in quotes regardless of whether the value is considered a numeric or a string value. For an XML document to be well-formed, it must have a single root element. The root element is the outermost element and contains all the other elements of the document. The root element in Figure 1.1 is customerinfo. It contains two attributes in its start tag, xmlns and Cid. The attribute Cid is used here to represent the customer identification number. The attribute xmlns is a reserved attribute and declares a namespace. Namespaces are optional and we defer their discussion to Chapter 15, Managing XML Data with Namespaces. XML element and attribute names are case sensitive. The tags <name>, <Name> and <NAME> are all completely distinct from each other. XML element and attribute names can contain letters, numbers, and certain other characters such as the underscore. However, tag names must not start with a number or punctuation character, must not start with the characters xml (or XML, xML, and so on), and must not contain spaces. The order in which elements appear in a document is significant. The order in which attributes appear within the start tag of an element is not significant. In other words, elements are ordered, attributes are not ordered. When to use elements and when to use attributes to represent certain data items is a data modeling question and addressed in Section 2.1, Choosing Between XML Elements and XML Attributes. Further discussion of XML documents and their hierarchical representation is provided in Section 3.1, Understanding XML Document Trees. 1.2 DIFFERENCES BETWEEN XML AND RELATIONAL DATA For a comparison of XML and relational data, let’s consider the simple XML document and the relational table in Figure 1.2. The relational table has six columns with fixed names and data types. This table is a very strict and inflexible structure because every row in the table has to have exactly the same format with the same number of columns and the same data types. It is not possible that one row in the table has more or fewer columns than the next. It is also not possible for a column to have no data type or more than one data type. Each column has to have exactly one fixed data type. Moreover, the structure and data types of the table are defined before any data is inserted. Whenever data is inserted or retrieved from this table, the format of the rows is known without looking at the actual data. The strict schema provides a lot of information about the data and its format, which allows for very efficient access. The XML document in the left side of Figure 1.2 represents similar data as the row in the table on the right. With DB2 pureXML you can store, index, query, and update this XML document even if there is no XML Schema that defines its structure or the data types of its elements. You may have an XML Schema for this XML document, but you don’t have to. The document itself contains some meta information that describes the data items, but no further schema information is necessary to store and query this document.
1.2 Differences Between XML and Relational Data <customerinfo Cid=" 1003"> <name> Robert Shoemaker </name> <addr> <street> 845 Kean Street </street> <city> Aurora</city> </addr> <phone> 905-555-7258 </phone> </customerinfo> 5 CREATE TABLE address(cid INTEGER, name VARCHAR(30), street VARCHAR(40), city VARCHAR(30), email VARCHAR(50), phone VARCHAR(20)) CID NAME STREET CITY EMAIL PHONE 1003 Robert Shoemaker 845 Kean Street Aurora NULL 905-555-7258 Figure 1.2 XML document (left) and relational table (right) Assume you receive information about another customer whose street name is 42 characters long. Inserting this information into the relational table fails with an error that needs to be handled. This error can be desirable because it enforces a certain constraint, but it can also be undesirable because it prevents the new information from being stored and processed immediately. Because XML allows more schema flexibility, a document with a 42-character street name can be inserted without an error. The absence of an error can be desirable because it allows the data to be stored immediately, but it can also be undesirable because the excessive length of the street value goes undetected and can cause problems in later processing steps. Clearly, the flexibility of XML needs to be used with care and only to the degree that is appropriate for a given application. Optionally, you can choose to use an XML Schema that constrains the XML document as strictly as the relational table in Figure 1.2. You could also choose to use a less stringent XML Schema. For example, you could use an XML Schema that requires the Cid value to be an integer and the name to not exceed 30 characters, leaving the data types of all other data items unconstrained. You can choose the degree of schema flexibility that is right for your application. Note that the relational table in Figure 1.2 contains a NULL value in the column email. In the XML document, an email element is simply omitted if this customer does not have email. Optional XML elements are another form of schema flexibility. Assume you receive information about a customer where, unexpectedly, the name of his assistant is included. The assistant name can easily be accommodated with an optional assistant element in an XML document. However, the relational table in Figure 1.2 does not allow the assistant name to be stored. Next, let’s consider a schema change. Due to unforeseen changes in your business, you now need to store multiple phone numbers per customer. Reacting to this change is simple with XML. The document in the left side of Figure 1.3 simply uses multiple occurrences of the phone element. The repeating phone elements represent the new one-to-many relationship between customers and phones. Existing XPath queries that read phone elements do not change. Accommodating
6 Chapter 1 Introduction multiple phone numbers per customer in the relational schema requires normalization, which is a drastic schema change. Existing SQL queries must be modified to perform the proper join between the two relational tables. Downtime and service interruptions are likely. CREATE TABLE phones(cid INTEGER, phone VARCHAR(20)) <customerinfo Cid=" 1003"> <name> Robert Shoemaker </name> <addr> <street> 845 Kean Street </street> <city> Aurora</city> </addr> <phone> 905-555-7258 </phone> <phone> 416-555-2937 </phone> </customerinfo> CID 1003 1003 PHONE 905-555-7258 416-555-2937 CREATE TABLE address(cid INTEGER, name VARCHAR(30), street VARCHAR(40), city VARCHAR(30), email VARCHAR(50), phone VARCHAR(20)) CID NAME STREET CITY EMAIL PHONE 1003 Robert Shoemaker 845 Kean Street Aurora NULL 905-555-7258 Figure 1.3 A schema change in XML and relational data Some of the key differences between XML and relational data are summarized in Table 1.1. The flexibility of XML implies that examining and interpreting XML data can consume more computing resources than if the same data was stored in relational form. The reason is that information about the structure of the XML data needs to be discovered at runtime because a fixed schema is not always present. The relational data model relies on much more rigid schema definitions than XML. For a relational table in a database, the structure of a row and the size and data types of its columns are known as soon as the table is created. Therefore, data access is more straightforward and can be more efficient than for XML data. As such, relational data can provide very high performance but might fail to meet application requirements for schema flexibility. Table 1.1 Comparison of Relational and XML Data Relational Data XML Data Highly structured, highly regular in nature Semi-structured, can be highly variable in nature Rows are flat Data is hierarchical, can be arbitrarily nested Fixed schema and metadata Variable schema and metadata Fixed number of columns per table No fixed format, flexible number of elements and attributes per document Fixed data type for all values in a column Data types are optional and can be variable
1.3 Overview of DB2 pureXML Table 1.1 7 Comparison of Relational and XML Data (Continued) Data format defined by DDL, known at query/ insert/update compile time Data format not necessarily predefined, not known until query/insert/update runtime NULL values represent missing information Optional elements and attributes can be omitted Schema changes can be expensive Schema changes are less expensive In some cases, the nested and flexible structure of XML can offer performance benefits over relational schemas. Relational databases often require normalization to fit business data into flat, tabular structures. This normalization of complex business data requires transformation when data is stored and retrieved, and often leads to multi-way join queries in relational databases. XML can provide a more natural representation of complex business objects with all relevant relationships represented in a single document. The hierarchies within an XML document are essentially precomputed joins between related data items. 1.3 OVERVIEW OF DB2 PUREXML This section provides a condensed overview of the DB2 pureXML technology. It summarizes the most important aspects of DB2 pureXML, which are described in more detail in the remainder of this book. At the core of DB2 pureXML is the data type XML, which has been added to the SQL type system in the SQL:2003 standard. Database users can define tables that contain one or multiple columns of type XML. In each row, a column of type XML contains either a well-formed XML document or NULL. A table that contains one or more XML columns can also contain other columns, such as INTEGER, VARCHAR, or DATE columns. Hence, users can define tables that hold both XML data and traditional relational data in each row of the table. The integration of XML and relational data is therefore very easy. It is also possible to create a table that only contains a single column of type XML and no other columns. DB2’s internal XML storage mechanism does not store XML data as text in large objects (LOBs) and does not convert XML to relational format. When you insert or load XML documents into a column of type XML, DB2 stores the XML documents in a parsed hierarchical format. Each XML document is parsed only once; that is, when it is first inserted into an XML column. The parsed storage format allows queries and updates to operate on XML data without XML parsing—a key performance benefit. The maximum XML document size is 2GB. You can use regular SQL statements to insert, delete, and update (replace) full XML documents. XML insert, update, and delete operations are logged by default and XML data is always buffered in the buffer pool. XML data participates in backup, restore, and recovery operations just like traditional relational data in the database. XML data can be compressed, replicated, and
8 Chapter 1 Introduction federated, and is allowed in range-partitioned tables, clustered tables (MDC), and partitioned database environments (DPF). Partitioning keys and clustering keys must be relational columns. All the critical database utilities support XML data, such as LOAD, UNLOAD, IMPORT, EXPORT, RUNSTATS, REORG, BACKUP, RESTORE, and others. In DB2 for Linux, UNIX, and Windows, XML columns are also supported by High Availability Disaster Recovery (HADR). An XML Schema can be used to constrain XML documents, but the usage of XML Schemas is optional in DB2. In particular, you do not need to provide an XML Schema to create a column of type XML or to insert XML documents. DB2’s pureXML storage format does not depend on XML Schemas. When you insert, update, or load XML documents, you can choose to validate the documents against one or multiple XML Schemas. If you choose to validate documents, the validation and the association of schemas to documents happens on a per-document basis, not on a per-column basis. DB2 does not require all documents in an XML column to belong to the same XML Schema, although you can enforce that with triggers if you want. Since schema flexibility is often a key reason for using XML, DB2 allows documents for multiple schemas, or multiple versions of a schema, to coexist in a single XML column. XML Schema evolution is seamless and does not require any database downtime. The use of XML Schemas for document validation can help applications ensure XML data quality. However, there is no performance penalty if you store XML documents without validation in DB2. Although XML Schemas can constrain one XML document at a time, there is no standard or XML technology yet to define constraints or referential integrity across XML documents or across XML and relational data. However, when you insert XML documents into a table you can choose to extract selected element or attribute values into relational columns. DB2 can perform such value extraction as part of the INSERT statement, but it can also be automated with triggers. Then you can define relational constraints, such as foreign keys and check constraints, on the populated relational columns. In DB2, XML data can be queried with XPath and SQL/XML, and in DB2 for Linux, UNIX, and Windows, also with XQuery. The SQL/XML standard allows XPath and XQuery expressions to be embedded in SQL statements so that XML and relational data can be queried together in a single query. Joins between XML columns or between XML and relational columns are possible. The SQL/XML function XMLTABLE can be used to query XML data and return the result set in relational format. Other SQL/XML functions support the opposite; that is, to query traditional relational tables to construct and return XML documents that contain the data values. To ensure high performance for XML queries, DB2 allows you to create XML indexes on specific XML elements and attributes that you specify with an XPath. Similar to the relational world, it makes sense to index those XML elements and attributes that are frequently used in query predicates and join conditions. Although you can decide to index all elements and all attributes in all documents in an XML column, you are not forced to do so. Indexing selected elements and attributes is often preferred. If you define an XML index on an optional element that, for example, occurs in only 5% of the documents (rows), then the index is quite small because it contains
1.3 Overview of DB2 pureXML 9 entries only for those 5% of the documents and rows in the table. In contrast, relational indexes always contain exactly one entry for each row in a table. If a query contains relational predicates and XML predicates, DB2 can use a combination of XML and relational indexes to evaluate the query. DB2’s RUNSTATS utility can collect statistics for XML data which the DB2 optimizer uses to create efficient query execution plans. Although DB2 uses separate storage formats for XML and relational data, DB2 only has a single processing engine and a single query compiler and optimizer that handle any mix of relational and XML queries. DB2’s EXPLAIN facility can be used to examine the execution plans for XML queries just like for relational queries. DB2 for Linux, UNIX, and Windows also supports XQuery Updates to modify, insert, delete, or rename individual XML elements and attributes within an XML document. XSLT transformations as well as full-text search over XML data are also supported. Access control as well as concurrency control (locking) for XML data happens on the level of full documents. Since each XML document belongs to a row in a table, access control and concurrency control for a particular row determines the accessibility of the XML document in that row. Access rights and privileges cannot be defined for individual elements within an XML document. The XML data type can be used for more than just the definition of XML columns. For example, you can define XML parameters and XML variables in SQL stored procedures and user-defined function (UDFs). Such procedures and UDFs can contain XQuery or SQL/XML statements to manipulate XML documents while they remain in DB2’s internal parsed format. Application development for DB2 pureXML is based on existing but enhanced APIs. The traditional database APIs such as JDBC, ODBC/CLI, ADO.NET, or embedded SQL all support XQuery and SQL/XML statements as well as the exchange of XML data between a DB2 server and a client application. The JDBC 4.0 standard defines a new Java data type SQLXML to match the data type XML defined by the SQL standard. Similarly you can define XML host variables in COBOL, C, PL/1, and Assembler. With DB2 pureXML, applications can often avoid XML parsing, because DB2 stores XML documents in a parsed format. The parsed storage allows you to extract or update document fragments or individual values without having to parse the XML data in your application. Applications send appropriate XML query or update statements to DB2 instead of fetching and parsing full documents. As a result, using DB2 pureXML leads to less application code, reduced application complexity, and higher end-to-end performance. Both the DB2 Control Center and IBM Data Studio support DB2 pureXML through a variety of wizards and visual interfaces. For example, you can view the tree structure of XML documents, create XML indexes with point-and-click into XML documents, design and register XML Schemas, or build XQuery and SQL/XML statements with context assist in Data Studio’s statement editor.
10 Chapter 1 Introduction 1.4 BENEFITS OF DB2 PUREXML OVER ALTERNATIVE STORAGE OPTIONS FOR XML DATA Prior to the availability of DB2 pureXML, the two main storage options for XML data in relational databases are LOB storage and shredding: • The LOB storage approach stores full XML documents in their textual form in character or binary large object columns (CLOB or BLOB). Other columns in the same table typically contain document identification numbers or other information that helps applications to identify specific XML documents for retrieval or replacement. The main problem of this approach is that the XML documents are stored as if they were arbitrary pieces of text. The XML structure is ignored and not immediately visible. Therefore any operation that needs to access individual elements or attributes in a document requires XML parsing. For example, any query that extracts element values requires XML parsing at runtime. The resulting parsing overhead for query and update execution is a major performance problem that renders LOB storage inadequate for most XML applications. • Shredding (decomposing) XML documents into relational tables converts XML data into relational format. Shredding first requires a design stage where an administrator maps XML elements and attributes to relational columns. When XML documents are inserted, they are parsed, broken up, and only their atomic data values are retained (see Figure 1.4). These values are inserted into the relational target tables by a series of INSERT statements. After an XML document has been shredded, its values are stored in these tables without the original XML tags. Depending on the complexity of the XML documents, shredding can require dozens or hundreds of relational tables to represent all the hierarchical relationships among the original XML elements and attributes. In many real-world XML applications this complexity is staggering such that even the mapping task is considered prohibitively expensive or unfeasible. Queries over decomposed XML data often require multi-way SQL joins that tend to be difficult to develop and tune. Changes or variability in the XML input format often break the mapping to the relational database schema, which incurs time-consuming maintenance. A fixed schema mapping that is costly to change negates the flexibility for which XML is typically used. DB2 pureXML has been designed to overcome the problems that are inherent in LOB storage and shredding. The advantages of DB2 pureXML and its native XML storage format include: • Retaining awareness of the internal structure of the XML data: Contrary to LOB storage, DB2 pureXML stores XML in a parsed tree format that explicitly represents the structure of each XML document. As a result, applications can query and update XML data using XQuery, XPath, and SQL/XML without XML parsing at runtime. This is a critical performance benefit. Additionally, query performance can be enhanced by creating indexes on specific elements and attributes in the XML documents.
1.5 XML Solutions to Relational Data Model Problems LOB storage: stores XML as text XML DOC 11 Shredding: XML Relational Schema Mapping DB2 pureXML: stores XML as XML XML DOC XML DOC Shredder XML DOC XML DOC XML DOC XML Index CLOB Column regular relational tables XML Column Figure 1.4 DB2 pureXML and alternative XML storage options • Keeping business objects intact: DB2 pureXML stores each XML document as a cohesive unit that belongs to one row in a table, providing a very intuitive storage and processing model for the application developer. In contrast, XML shredding scatters the values of each XML document over a number of tables. Hence, shredding can result in an unwieldy relational schema that is difficult to understand and inefficient for queries and the reconstruction of XML documents. • Schema flexibility: While shredding requires all XML documents to adhere to a single XML Schema that is mapped to relational tables, DB2 pureXML can store documents for variable or evolving schemas in the same XML column. The cost of schema evolution is much lower for DB2 pureXML than for a shredding approach. • Faster application development: Because DB2 pureXML does not require any schema mapping and uses a single XML column instead of complex relational schema, prototyping and designing applications can be much simpler with DB2 pureXML than with shredding. 1.5 XML SOLUTIONS TO RELATIONAL DATA MODEL PROBLEMS The data model that you use for your business data should allow for an easy and intuitive representation of your data and should efficiently support the most critical usage and access patterns. If the data being modeled is naturally tabular, it is typically better to represent it in relational format than as XML. However, there are cases where the relational model is not necessarily the best choice and sometimes even a poor choice to hold your data. The following are some situations where an XML representation tends to be more beneficial than the relational format.
12 Chapter 1 Introduction 1.5.1 When the Schema Is Volatile Problem with relational data: If the schema of the data changes often, then a relational representation of the data is subject to costly relational schema changes. Although some forms of schema modification are relatively painless in relational databases, such as adding a new column to a table, other forms are more involved, such as dropping a column or changing the type of a column. Still other forms of schema modification are extremely difficult, such as normalizing one table into multiple tables. Changing the tables means that the SQL statements in the applications that access them must also be changed. Solution with XML data: Portions of the schema that are volatile can be expressed as a single XML column. The self-describing and extensible nature of XML allows seamless handling of schema variability and evolution. Changes in the XML document format are accommodated without changing tables or columns in the database and typically without breaking existing XML queries. 1.5.2 When Data Is Inherently Hierarchical in Nature Problem with relational data: Data that is inherently hierarchical or recursive is often difficult to represent in relational schemas. Examples include a bill of materials, engineering objects, or biological data. A bill of materials explosion can be stored in a relational database but reconstructing it in parts or in full might require recursive SQL. Solution with XML data: Since XML is a hierarchical data model, it is a much more natural fit for inherently hierarchical business data. Using XML allows simple, navigational data access to replace complex set operations, which would be required if the same data was represented in tabular format. 1.5.3 When Data Represents Business Objects Problem with relational data: If application data represents business objects, such as insurance claim forms, then it is often beneficial to keep the data items that comprise a particular claim together, instead of spreading them over a set of tables. This benefit is particularly important when the individual data items of a claim form have no valid business meaning by themselves and can only be interpreted in the context of the complete form. Normalizing the claims across dozens of relational tables means that applications deal with a complex and unnatural fragmentation of their business data. Such normalization can increase complexity and the chance for errors. Solution with XML data: XML enables you to represent even complex business objects as cohesive and distinct documents while still capturing all the relationships between the data items that comprise the business object. Representing each claim form (business object) as a single XML document in a single row of a table provides a very intuitive storage model for the application developer and enables rapid application development.
1.6 Summary 13 1.5.4 When Objects Have Sparse Attributes Problem with relational data: Some applications have a large number of possible attributes, most of which are sparse; that is, they apply to very few objects. A classic example is a product catalog where the number of different product attributes can be huge, including size, color, weight, length, height, material, style, weave, voltage, resolution, water resistance, and a near endless list of other properties. For any given product, only a subset of these attributes is relevant. One possible relational schema is to have one column per attribute, which means a very large percentage of the cells in the table contain NULL values. Large numbers of NULLs are undesirable and can be inefficient. A different relational approach for such sparse data is a three-column table that stores several name/value pairs for each product ID. In this name/value pair approach, the attribute names are not column names but values in a VARCHAR column. This design prevents relational database systems from accurately estimating constraint selectivity and generating efficient query plans. Finally, defining and enforcing constraints, such as uniqueness for a certain attribute, is extremely difficult. Hence, data quality and integrity suffers. Solution with XML data: The beauty of XML is that elements and attributes can be optional, so they are simply omitted if they don’t apply for a specific product. Neither NULL values nor name/value pairs are needed. The XML Schema can define a very large number of optional elements, but only few of them are used for any given object. While every row in a relational table has to have the exact same columns, XML documents in an XML column can have different elements from one row to the next. Also, an XML index for an optional element is very small if this element appears only in a small percentage of the documents (rows). This is a clear advantage over relational indexes which have exactly one entry per row. 1.5.5 When Data Needs to be Exchanged Problem with relational data: If you export a set of rows from a relational table and send them to another application or organization, the recipient cannot interpret the data without additional metadata that describes the columns. This separation of data from metadata in the relational world poses a particular problem if your relational schema has changed since the last time you sent data. Solution with XML data: XML data is self-describing. The XML tags are metadata and describe the values that they enclose. The nesting of XML elements further defines the relationship between data items. 1.6 SUMMARY XML, the extensible markup language, acts as a flexible and self-describing data format for data exchange, web services, and service-oriented architectures. XML is also a hierarchical data model that is inherently different from the relational model. While relational data processing is
14 Chapter 1 Introduction based on rigorous and predefined schemas that allow for limited flexibility, XML is well-suited to represent data with variable or evolving schemas. XML is also commonly used as a data format for semi-structured data or to integrate structured and unstructured data. Depending on the performance and flexibility requirements of particular applications, you will find that in some cases XML is a better choice than a relational schema, and in other cases relational data has advantages over XML. Many scenarios also exist in which a hybrid approach, that is, a mix of XML and relational data, is the best solution. Considerations for hybrid data models are discussed further in the next chapter. DB2 pureXML provides sophisticated capabilities for storing, indexing, querying, updating, and validating XML documents. The pureXML technology and its native XML storage format provide significantly higher performance and flexibility than alternative storage options for XML data, such as LOBs or shredding. DB2 pureXML also enables seamless integration of XML and relational data.
C H A P T E R 2 Designing XML Data and Applications his chapter looks at several design issues in the world of XML documents. Sometimes you might get involved in the design of a specific format for your XML documents and you will find that the design decisions made at this point can have a big impact on how your application processes XML. Therefore, this is the first stage of XML application design. In many other cases, the format of the XML documents that you need to process may have already been designed and decided by the time you get involved. Many vertical industries and consortia define specific XML Schemas to standardize the XML document formats that are used to exchange and process information within a particular industry. Some of them are discussed in Chapter 16, Managing XML Schemas. Even if you work with a predefined XML format, there are still decisions to be made, such as the most suitable granularity in which you should store XML documents or document fragments. T In this chapter you learn • How to choose between XML elements and attributes (section 2.1) • How to represent data as XML values and metadata as XML tags (section 2.2) • How to design documents with an appropriate size and scope (section 2.3) • How to decide on a “good” mix of XML and relational data (section 2.4) 2.1 CHOOSING BETWEEN XML ELEMENTS AND XML ATTRIBUTES A common question is when to use attributes and when to use elements, and whether this choice affects performance. It turns out that this is much more of a data modeling question than a performance question. As such, this question is as old as SGML, the precursor of XML, and has been 15
16 Chapter 2 Designing XML Data and Applications hotly debated with no universally accepted consensus. However, a key thing to remember is that XML elements are more flexible than attributes because they can be repeated and nested. Table 2.1 shows an example of an XML document with and without attributes. Both documents logically represent the same business data. They contain information about a book called “Database Systems”, written by authors “John Doe” and “Peter Pan” who have id numbers 47 and 58 respectively, and the price of the book is 29, but there is no information in either document about the currency of the price. In the document on the left of Table 2.1, price and title are child elements of the element book, and the author id is a child element of the element author. This approach is certainly a decent way of modeling the data. Alternatively, the document on the right has price and title as attributes of the element book, and id as an attribute of the element author. In general, both versions of the document, with and without attributes, can be reasonable choices. There is no immediate way to decide whether one of the two document formats is “better” than the other. Table 2.1 An XML Document with and without Attributes XML document without attributes: XML document with attributes: <book> <authors> <author> <id>47</id> <name>John Doe</name> </author> <author> <id>58</id> <name>Peter Pan</name> </author> </authors> <title>Database systems</title> <price>29</price> <keywords> <keyword>SQL</keyword> <keyword>relational</keyword> </keywords> </book> <book price="29" title="Database systems"> <authors> <author id="47">John Doe</author> <author id="58">Peter Pan</author> </authors> <keywords> <keyword>SQL</keyword> <keyword>relational</keyword> </keywords> </book> The document with attributes might be appealing because it is shorter. It contains 200 nonwhitespace characters as opposed to 248 in the document without attributes. An XML parser needs to look at every single character of a document, which generally means that shorter documents can be parsed faster. This reduction in parsing times may matter if you are designing an XML message format for very high-volume processing with near real-time performance requirements and throughput targets such as thousands of messages per second. However, many XML applications do not fall into this category and performance should be a secondary concern during XML modeling.
2.1 Choosing Between XML Elements and XML Attributes 17 More important is the flexibility and extensibility of the XML format, which is usually why XML is chosen to begin with. In the example in Table 2.1, chances are that the format of the price information eventually needs to be extended. This extension is easy in the document on the left where price is an element. For example, you can add an attribute currency to the price element to make it more descriptive. Also, as the business expands to international markets, you can easily repeat the price element multiple times to reflect the price of the book for different countries (see Figure 2.1). <book> <authors> <author> <id>47</id> <name>John Doe</name> </author> <author> <id>58</id> <name>Peter Pan</name> </author> </authors> <title>Database systems</title> <price currency="GBP">29</price> <price currency="JPY">5735</price> <price currency="EUR">35.80</price> <keywords> <keyword>SQL</keyword> <keyword>relational</keyword> </keywords> </book> Figure 2.1 Document with multiple price elements This extension of the price element has the very desirable property that XPath queries that worked for the old document format continue to work without changes for the new format. For example, the XPath /book/price returns the single price element from the document on the left in Table 2.1, but also all three price elements with their currency information from the new document format in Figure 2.1. This property helps to ensure seamless operation of applications during such a schema evolution. In the document on the right side of Table 2.1, where price is an attribute, such an extension is a lot harder to make if you want to keep using attributes. The existing price attribute cannot be extended to contain another nested attribute, and an attribute by the name of price can only occur once for the book element. You could certainly remove the existing price attribute and use price elements instead. This change implies that for older documents the XPath to the price information is /book/@price whereas for newer books it is /book/price. Thus, this change is invasive and indicates that you probably should have used elements to begin with. In such a situation you should not use multiple price attributes with different names, as shown in Figure 2.2. This design has a variety of undesirable consequences. First of all, XPath queries need
18 Chapter 2 Designing XML Data and Applications to be changed each time you introduce a new currency to your business. Second, this design makes it more complicated to retrieve all price information with a single query. Third, if your queries use search conditions on the price attributes then you will have to define a separate XML index for each currency, instead of just two indexes (on e for price and one for currency). These problems stem from the fact that the currency information is part of your business data, not part of the metadata. Hence, the currency should be a value and not part of a tag name. The use of tags and values is discussed further in section 2.2. <book priceGPB="29" priceJPY="5735" priceEUR="35.80" title="Database systems"> ... </book> Figure 2.2 Bad XML design with different names for price attributes Also note that the XML standard specifies that elements are ordered while attributes are unordered. For example, the three price elements in Figure 2.1 are in a fixed order, and this order is guaranteed when the document is parsed, stored, queried, or otherwise processed. In contrast, the three price attributes in Figure 2.2 do not have a significant order within the book element. They could appear in a different order and the document would still be considered “the same.” Hence, if the relative order among your data items is important, use elements instead of attributes. Although you could model all data without attributes, they can be a very intuitive choice for data items that are known in advance to never repeat (per element) nor have any subfields. Attributes contribute to somewhat shorter XML because they have only a single tag as opposed to elements, which have a start tag and an end tag. Shorter attribute tags are at most a minor performance bonus rather than an incentive to convert elements to attributes, especially when data modeling considerations actually call for elements. In DB2, attributes can be used in queries, updates, predicates, and index definitions just as easily as elements. There is generally no significant performance difference between accessing or updating elements versus attributes when XML documents are stored in DB2. Both elements and attributes can be defined as mandatory or optional in an XML Schema. As another example, let’s look at the XML document in Figure 2.3, which contains information about a department with two employees. The document uses attributes for the department and employee identifiers. This approach seems to make sense because each employee and department will always have just one ID value. Furthermore, an element is used for the employee telephone information, which allows an employee to have multiple occurrences of the phone element if needed. It is also extensible in case you later need to break telephone numbers into fragments. For example, the phone element could have child elements for country code, area code, and extension, which would not be possible if phone was an attribute.
2.2 XML Tags versus Values 19 <dept deptID='PR27'> <employee id='58043'> <name>John Doe</name> <phone>408-555-1212</phone> <phone>408-463-4880</phone> </employee> <employee id='81822'> <name>Peter Pan</name> <phone>408-255-8587</phone> <office>F589</office> </employee> </dept> Figure 2.3 A sample XML document The XML document in Figure 2.3 also raises another design question, which we discuss in section 2.3: Is it better to keep the information for all employees of a department in one document, or is it better to have one XML document per employee? 2.2 XML TAGS VERSUS VALUES The idea of XML as an extensible markup language is that the markup, which consists of all the element and attribute tags, describes the enclosed data values. The ability to use custom tags for markup makes XML a self-describing data format. The XML tags can also be considered metadata. Hence, XML documents conveniently combine data and metadata in a universally accepted format. An important aspect of designing XML documents is to distinguish clearly between data and metadata. The metadata should be represented as element and attribute names, the data as element and attribute values. This approach is analogous to relational modeling, where table and column names are metadata, and the values in the columns are the actual data. In XML it’s almost always a bad idea to represent metadata as values instead of tags, or actual data as tags instead of values. Let’s look at the examples in Table 2.2 and Table 2.3. The document on the left side of Table 2.2 contains information about the brand, price, and year of a car. The brand is Honda, the price is 5000, and the year is 1996. The terms “brand”, “price”, and “year” constitute meta information for the values Honda, 5000, and 1996. Hence, Honda is a data value, not metadata. Therefore it should be an XML element value, not an element name. The XML document on the right side of Table 2.2 is a better representation of the same data. There the term “brand” is used as an element name (meta information) for the value Honda. Imagine yourself modeling the same data in a relational table. You would not use Honda as a column name in a table. Avoiding business data in tag names has several advantages: • If you are using an XML Schema, you don’t need to add new element definitions to your XML Schema each time your business handles a new brand of car.
20 Chapter 2 Designing XML Data and Applications • You can always use the XPath /car/brand to retrieve the brand from a particular car document. Otherwise, if brand names are tags, many different or more complicated XPath expressions are necessary. • If you search for cars by brand then you can use XML indexes in a much simpler and more intuitive manner if the brand names are element or attribute values rather than tag names. Table 2.2 Business Data as Tags Versus Values Business data as element name (not recommended): <car> <Honda> <price>5000</price> <year>1996</year> </Honda> </car> Business data as element value (recommended): <car> <brand>Honda</brand> <price>5000</price> <year>1996</year> </car> What happens if you use meta information, such as the terms “brand”, “price”, and “year”, as values rather than element or attribute names? This is shown in the left side of Table 2.3 where the XML document consists of very generic tag names, such as object, type, field, name, and value. These tags are not very descriptive, which is contrary to the concept of XML as a selfdescribing data format. You see that the brand, price, and year of the car are represented by pairs, which consist of a name and a value. However, the names are actually XML attribute values, not descriptive tag names. This approach is commonly referred to as Name/Value Pairs (NVP), KeyValue Pairs (KVP), or Entity-Attribute-Value model (EAV). Table 2.3 Name/Value Pairs (Metadata as Tags Versus Values) Metadata as values, aka Name/Value Pairs (often bad): Metadata as element names (good): <car> <object type="car"> <brand>Honda</brand> <field name="brand" value="Honda"/> <price>5000</price> <field name="price" value="5000"/> <year>1996</year> <field name="year" value="1996"/> </car> </object> The Name/Value Pair approach to data modeling also sometimes appears in the relational world when a table with three columns (id, name, and value) is used. This approach may seem attractive when dealing with entities that can have hundreds or thousands of attributes, but only a small number of them apply to any individual entity. If you were to represent each possible attribute by a column in a relational table, you might exceed the maximum row length or the maximum number of columns in a table. Nevertheless, the Name/Value Pairs approach has very significant and inherent drawbacks, which are similar for XML and relational data. In particular:
2.2 XML Tags versus Values 21 • Defining business rules and constraints for Name/Value Pairs is very difficult and often impossible. You cannot define an effective XML Schema to control and constrain this type of XML data. If you use the “better” XML format shown in the right side of Table 2.3, an XML Schema can easily specify that the value of the price element has to be greater than zero, and the value of the year element has to be a four-digit integer between 1950 and 2099. In the Name/Value Pairs in the left column of Table 2.3, price and year are represented by the same XML attribute called value. An XML Schema does not allow you to specify that if there is an attribute called name with the value price then the value of the attribute value in the same field element must be greater than zero. • Name/Value Pairs handle all data as strings (text). Since the attribute value can contain arbitrary data values, it cannot be typed as INTEGER, DECIMAL, DATE, or TIMESTAMP. Handling all data as strings leads to data quality issues because proper data types cannot be enforced. Another consequence is that any indexes and comparisons have to treat the data values as strings. If you search for cars with a price greater than “5000”, you will also find cars with prices such as “600” or “900” because these strings are greater than the string “5000”. You can solve this problem with appropriate cast operations in your queries, but those usually preclude the use of indexes, which means performance suffers. • Writing queries against Name/Value Pair data is very complex. As an example, assume that you need to retrieve the years of all Honda cars that have a price greater than 5000. The corresponding XPath expression for the Name/Value Pair data is shown in Figure 2.4, followed by the same query for the “regular” XML data in the right side of Table 2.3. The difference in complexity is striking, and it is even greater for more advanced search queries. -- XPath query to retrieve the years of all Honda cars with a -- price greater than 5000 from Name/Value Pair XML data: /object[@type="car" and field[@name = "brand" and @value = "Honda"] and field[@name = "price" and @value > "5000"] ]/field[@name="year"]/data(@value) -- Same query for regular XML Data: /car[brand="Honda" and price > 5000]/year Figure 2.4 Complexity when querying Name/Value Pairs
22 2.3 Chapter 2 Designing XML Data and Applications CHOOSING THE RIGHT DOCUMENT GRANULARITY When you design your XML application, and in particular your XML document structure, you may have a choice as to which business data is kept together in a single XML document. Is it better to keep a lot of data in a large XML document, or is it better to use many small documents instead? The proper scope of any given document is a critical design decision. The general recommendation is to choose an XML document granularity such that one document represents one logical business object from an application point of view. Another guideline is to use an XML document granularity that matches the anticipated predominant granularity of data access or data exchange. Very often the logical business objects match the predominant granularity of data access, so these two guidelines lead to the same result. What constitutes a small, medium, or large XML document? Very roughly, XML documents up to 50KB are typically considered small, documents between 50KB and 1MB are often considered medium, and documents of more than 1MB are considered large. Documents in the range of hundreds of Megabytes or a few Gigabytes are huge, relatively rare, and almost always the result of combining a large number of smaller XML documents. Let’s look at the example in Figure 2.5, which shows three design options to represent data for several orders. Each order has a date, a customer name, and several parts, which have a key, a quantity, and a price. Let’s assume that you have to store and manage these orders for a particular application that treats each individual order as a separate logical business object. It typically receives and processes one order at a time, and a single order is the predominant level of access or transmission. In case (a) on the left, multiple orders are combined in one large document (coarse granularity). This approach can be useful when you need to archive or FTP a certain batch of orders, such as all orders for the past week, for example. Storing this large document as-is in a database is only a good idea if this batch in its entirety represents a meaningful business object to your application and users. This is not the case in our example. Since our fictitious application typically reads and writes one order at a time, storing many orders in a single large document would result in suboptimal performance. In general, combining many independent business objects in a single document is not recommended. DB2 uses indexes over XML data to filter on a per-document level. Therefore, the finer the XML document granularity, the higher the potential benefit from indexbased access. Although DB2 pureXML helps you avoid a lot of XML parsing in the application layer, some applications might still use a DOM parser to ingest XML documents and run into performance problems or failures if the documents are too large. Many XML design and editing tools also use DOM parsers and are often unable to handle very large XML documents. Therefore, debugging and correcting XML documents is much easier if they are small. In case (b), each order is a separate XML documents (medium granularity). This approach matches the nature of the application and not only provides good performance but is also very
2.3 Choosing the Right Document Granularity 23 intuitive for the application developer. One row in the database contains one business object for the application and no joins are required to retrieve all data for this object. Case (c) on the right represents fine granularity. Each order and each part is stored as a separate XML document. This approach can be a very good choice if each part information in itself is a separate business object of interest and often accessed and processed independently from the order it belongs to. In this example, however, part information has no real business meaning on its own and is dependent on an order. For example, the quantity and the price of a part are relevant only for a specific order. A different order can contain the same part with a different price and quantity. Typically, an application always needs to see all parts of an order and would never retrieve a part by itself without order information. Another reason why case (c) might not be useful is that having part and order information in separate documents would require joins between them. These reasons make case (b) desirable because the XML documents already represent this join in their structure. (a) <allorders> <order date='2004-11-05'> <customer>Doe</customer> <part key='82' > <quantity>5</quantity> <price>5.00</price> </part> <part key='83' > <quantity>11</quantity> <price>19.95</price> </part> </order> <order date=‘2004-11-06'> <customer>Doe</customer> <part key='19'> <quantity>23</quantity> <price>1.99</price> </part> <part key='83'> <quantity>1</quantity> <price>24.95</price> </part> </order> </allorders> (b) <order date='2004-11-05'> <customer>Doe</customer> <part key='82' > <quantity>5</quantity> <price>5.00</price> </part> <part key='83' > <quantity>11</quantity> <price>19.95</price> </part> </order> <order date='2004-11-06'> <customer>Doe</customer> <part key='19' > <quantity>23</quantity> <price>1.99</price> </part> <part key='83' > <quantity>1</quantity> <price>24.95</price> </part> </order> (c) <order date='2004-11-05'> <customer>Doe</customer> <part key='82'/> <part key='83'/> </order> <order date=‘2004-11-06'> <customer>Doe</customer> <part key='19'/> <part key='83'/> </order> <part key='82' > <quantity>5</quantity> <price>5.00</price> </part> <part key='83' > <quantity>11</quantity> <price>19.95</price> </part> <part key='19'> <quantity>23</quantity> <price>1.99</price> </part> <part key='83'> <quantity>1</quantity> <price>24.95</price> </part> Figure 2.5 Different document granularities
24 Chapter 2 Designing XML Data and Applications In a nutshell, choose the XML document granularity with respect to the logical business objects and the anticipated predominant granularity of access. When in doubt, it is usually better to lean towards finer granularity and smaller XML documents. 2.4 USING A HYBRID XML/RELATIONAL APPROACH XML is not the grand solution for all data management problems. As discussed in Chapter 1, Introduction, XML can provide significant advantages if the structure of your data is highly variable, evolves over time, or is hard to represent in a simple relational schema. Also, if you receive and send business objects in XML format, you can often improve performance and simplify applications if you also store these objects in XML format. Storing XML objects in XML format avoids complex mappings and costly transformations. However, sometimes the best solution is to store some of your data in relational format and some of your data in XML format, which is called hybrid storage. There are no definitive rules that describe precisely how to determine the right mix of XML and relational data. The right mix depends on the specific characteristics and requirements of a given application, or set of applications, that access the data. The following considerations can help you find the right design for your application. It is quite common that business objects such as orders, trades, sales records, customer records, emails, and blog posts consist of a fixed header plus a highly variable body. The header contains certain data fields that are common for all business objects of the same category. The body can be very different from one business object to the next and can contain any of thousands of optional attributes. For example, a financial trade might contain a header with the trade ID, the trading date, and the IDs of the two parties involved in the trade. Although these data items are present for every trade, the elements in the body of the trade depend highly on the exact nature of this particular trade. In this case, you might want to store the header fields in relational columns and the body in an XML column of the same table. Similarly, think of XML documents such as emails, blog posts, or CRM (customer relationship management) records produced in a call center. CRM records often contain the customer name and identifier, the date when the customer called in to report a problem, the name or ID of the product or service that the customer needs help with, and most likely a unique identifier of the CRM record itself. This data is very regular and structured with well-defined data types and can easily be stored in relational columns. However, the body of a CRM record typically contains semi-structured information with free text as well as interspersed data fields such as dates and a user ID to track when and by whom new information gets appended. This semi-structured part of the CRM record is better stored as a whole in an XML column.
2.5 Summary 25 If a business object arrives as an XML document, DB2 can extract selected element or attribute values from the document as part of the INSERT statement, without any extra XML parsing. This process is explained in more detail in Chapter 11, Converting XML to Relational Data. The benefits of storing some data fields of a business object in relational format can include the following: • You can define primary key and foreign key constraints on relational columns, but not on any elements or attributes in an XML column. • You can define multi-column (composite key) indexes on two or more relational columns, but you cannot define a composite key on two or more elements or attributes in an XML column. • Relational columns can be used to define range partitioning, hash partitioning, or multidimensional clustering for a table. These cannot be defined based on elements or attributes in an XML column. • Queries can use regular relational SQL predicates for relational columns, which some people find easier to use than XML predicates. • If you use WebSphere Replication Server to replicate rows to another database server, you can define filtering conditions on relational columns of the source so that rows are selectively replicated only if they meet the specified condition. Such replication filters cannot be specified on XML columns. • Relational column values can be referenced in the definition of generated columns and materialized views, but XML columns and individual XML elements and attributes cannot. 2.5 SUMMARY Designing an XML application begins with designing the XML data. The more appropriately you design your XML data for your business needs and application, the easier it will be to process and manage this XML data efficiently. Both your applications and your database will run best if the scope and granularity of your XML documents match the logical business objects of your application as well as the most frequent granularity of data access or data exchange. Try to favor smaller documents rather than larger documents. For the low-level design of your XML documents, keep in mind that XML elements are more flexible than attributes because they can be repeated and nested. You often want to favor XML elements over attributes to ensure future extensibility of your XML data. Also, make sure that meta information that describes your data is represented by XML element and attribute names, not by values. Conversely, the actual data items that your applications need to read and manipulate should be XML element and attribute values, not XML tags. Remember the analogy to the columns in relational tables, where column names represent metadata while the column content is your business data.
26 Chapter 2 Designing XML Data and Applications Often you do not have the luxury to design your XML document format. Many XML applications are forced to consume and process XML documents in a format that has previously been designed by other parties and cannot be changed. You can still choose to let DB2 split those documents into smaller fragments if that better matches the predominant granularity of access. Additionally, it can be advantageous to extract a few selected elements or attribute values from each document into relational columns. Chapter 5, Moving XML Data and Chapter 11, Converting XML to Relational Data explain DB2’s capabilities for splitting XML documents and hybrid XML/relational storage.
C H A P T E R 3 Designing and Managing XML Storage Objects n this chapter we discuss how to create and configure a database, table spaces, and tables to manage XML data. This discussion includes topics such as hierarchical XML storage structures, XML compression and inlining, monitoring and measuring XML storage consumption, reorganization, and partitioning of tables and databases that contain XML data. The topics in this chapter are organized as follows: I • Understanding XML document trees and their pureXML storage representation. These concepts are platform independent (sections 3.1 and 3.2) • Managing XML storage in DB2 for Linux, UNIX, and Windows (sections 3.3 through 3.10) • Managing XML storage in DB2 for z/OS (sections 3.11 and 3.12) • XML parsing and XML memory options specific to DB2 for z/OS (section 3.13) When you create a database that will contain XML data, one of the first design choices is to choose a code page. The recommended code page is UTF-8 Unicode. The benefits of Unicode are explained in Chapter 20, Understanding XML Data Encoding. It is also possible to manage XML in a non-Unicode database, which allows you to easily add XML to existing databases that do not use UTF-8. DB2 9 for z/OS allows XML columns in databases and table spaces of any supported encoding. In DB2 9.5 and 9.7 for Linux, UNIX, and Windows, all new databases use UTF-8 as the default code page. However, you can specify a non-Unicode code page in the CREATE DATABASE statement, if you want. 27
28 Chapter 3 Designing and Managing XML Storage Objects DB2 9.1 for Linux, UNIX, and Windows is slightly more restrictive because pureXML is available only in UTF-8 encoded databases, and you must explicitly set the database code page to UTF-8 in the USING CODESET clause of the CREATE DATABASE statement: CREATE DATABASE mydb USING CODESET utf-8 TERRITORY us Before we discuss how XML documents are physically stored in a DB2 database, let’s look at how the XQuery Data Model defines XML document trees. 3.1 UNDERSTANDING XML DOCUMENT TREES Since XML is a hierarchical data model, every XML document can be represented as a tree of nodes. Any query or update of XML data traverses the hierarchical structure of the XML documents. This traversal can be done most efficiently if the XML documents are physically stored in a hierarchical format. Therefore, DB2 for z/OS and DB2 for Linux, UNIX, and Windows store XML documents as trees of nodes with parent-child relationships between the nodes. These trees are defined by the XQuery Data Model (XDM) and described in this section. Further details of the XQuery Data Model are covered in Chapter 6, Querying XML Data: Introduction and XPath. Let’s look at the XML document in Figure 3.1 as an example. It is a simple document that contains information about a customer. The outermost element, customerinfo, is called the root element. Its children are the elements name and addr as well as two occurrences of the element phone. The element addr has an attribute country as well as four child elements: street, city, state, and zip. Each phone element has an attribute called type. <customerinfo> <name>Jim Noodle</name> <addr country="US"> <street>555 Bailey Ave</street> <city>San Jose</city> <state>CA</state> <zip>95141</zip> </addr> <phone type="work">408-289-4136</phone> <phone type="cell">408-710-7910</phone> </customerinfo> Figure 3.1 Sample XML document Figure 3.2 shows the same XML document in its tree representation. Such a tree can be constructed by parsing a textual XML document with an XML parser. In general, an XML document tree can have six different types of nodes. Element nodes, attribute nodes, text nodes, and the document node are the most common node kinds. They occur in the tree in Figure 3.2. Occasionally, XML documents can also contain comment nodes and processing instruction nodes.
3.1 Understanding XML Document Trees 29 Every XML element of the document in Figure 3.1 is represented by an element node in the corresponding document tree in Figure 3.2. The element nodes are white and rectangular. The textual value of each element is represented by a separate text node, shown in gray. Attribute nodes are shown with a double border. An attribute node contains all information about an attribute, including its value. The XQuery Data Model also defines that each document tree has a document node, shown in Figure 3.2 as a black circle. It is the topmost node and the parent of the root element. The document node is not visible in the textual representation of an XML document, only in its parsed hierarchical format. You will see later in this book that the document node is sometimes important when you manipulate XML documents. For example, assume you cut off the addr branch from the tree in Figure 3.2. This branch by itself does not have a document node and is therefore not a valid document tree. Hence, inserting it as a document into an XML column would fail unless you construct a new document node. Construction of a document node is shown in Chapter 5 (see section 5.7, Splitting Large XML Documents into Smaller Documents). customerinfo name Jim Noodle addr country=US street 555 Bailey Ave Figure 3.2 city San Jose phone state CA zip type=work 408-289-4136 phone type=cell 408-710-7910 95141 XML document tree You might wonder why element values reside in separate text nodes while attribute values do not. The main reason is that the child nodes of an element can be a mix of text nodes and other element nodes, which is known as mixed content. An attribute, however, has exactly one value and never any child nodes, which makes attributes less extensible than elements. An element can have multiple text node children but they cannot be adjacent siblings to each other. As an example of mixed content and multiple text node children, consider the following two XML documents, both of which contain a title element. In the first case the title has a single text value and the corresponding tree representation is shown in Figure 3.3(a). The title element in the second document contains some text, “The ” and “ Cookbook” (note the spaces!), as well as a child element bold.
30 Chapter 3 Designing and Managing XML Storage Objects Figure 3.3(b) shows that this results in a mixed set of child nodes under the title element: two text nodes and one child element (bold). The two text nodes “The ” and “ Cookbook” are separated by the element bold and are not adjacent children. If they were adjacent they would automatically collapse into a single text node. (a) (b) <title>The DB2 pureXML Cookbook</title> <title>The <bold>DB2 pureXML</bold> Cookbook</title> title title The bold Cookbook The DB2 pureXML Cookbook DB2 pureXML (a) Figure 3.3 (b) An example of mixed content Note the XQuery Data Model defines the value of an XML element as the concatenation of all text nodes in the subtree under that element. This concatenation is trivial for elements that have only one text node. The value of the element state in Figure 3.2 is “CA”, and the value of title in Figure 3.3(a) is “The DB2 pureXML Cookbook”. At the same time, the value of the title element in Figure 3.3(b) is also “The DB2 pureXML Cookbook”, and the value of the element bold is “DB2 pureXML”. Similarly, the value of the addr element in Figure 3.2 is “555 Bailey AveSan JoseCA95141” (note that there is no space between Ave and San and also no space between Jose and CA and 95141). The addr element is called a non-leaf element, and this example shows that values of non-leaf elements are often not useful. 3.2 UNDERSTANDING PUREXML STORAGE The document tree in Figure 3.2 illustrates the hierarchical format in which XML documents are stored in DB2 (all platforms). When an XML document in its textual format is inserted or loaded into an XML column, the DB2 server parses the XML document to produce the parsed hierarchical format that is stored on pages in a table space. This process is reversed when an application retrieves an XML document from DB2. This reverse process is called serialization; that is, the document tree is converted back into the text format of the XML document. You can think of parsing and serialization as inverse operations.
3.2 Understanding pureXML Storage 31 The exact shape of a document tree in DB2’s storage layer depends on and can vary with each individual instance document. It is not pre-defined based on an XML Schema, which allows DB2 to store documents with widely varying or evolving structures in the same XML column. DB2 performs a variety of optimization when storing document trees on pages. For example, element and attribute names (also called tag names) are transparently replaced by unique 4-byte integer numbers. Thus, DB2’s internal tree format looks actually more like Figure 3.4 than Figure 3.2. In addition to the integer number, each node can also contain other properties, such as information about namespaces and data types. 100 101 Jim Noodle 102 103=US 104 555 Bailey Ave Figure 3.4 109 San Jose 116 106 CA 113 110=work 408-289-4136 116 110=cell 408-710-7910 95141 XML document tree with tag names replaced by integer values The mapping from tag names to the so-called stringIDs is kept in the catalog table sysibm. sysxmlstrings (see Figure 3.5). This mapping is database-wide, where each distinct tag name and each distinct namespace URI has exactly one entry. For example, the phone element occurs twice in the sample document and may occur millions of times across all the XML documents in a database. Each occurrence is replaced with the same unique stringID, which is 116 in this example. Hence, the phone element has only one entry in the mapping table. Consequently, the mapping table is never larger than the number of distinct tag names in the database, which is typically a small number (several hundred to several thousand).
32 Chapter 3 Designing and Managing XML Storage Objects STRING STRINGID IS_TEMPORARY customerinfo 100 N name 101 N addr 102 N country 103 N street 104 N city 109 N state 106 N zip 113 N phone 116 N type 110 N … … … Figure 3.5 Mapping tag names to integers in sysibm.sysxmlstrings When a document is inserted and parsed, DB2 checks every tag name to see whether it is already recorded in this mapping table. If it is not, a new entry is added to the mapping table. Otherwise the existing stringID for the tag is used. Hence, inserts into the mapping table are very rare and occur only for new elements that DB2 has never seen before in a given database. For example, if you insert a million documents of similar structure, there is a good chance that only the first document, or the first few documents, actually cause inserts into the sysibm.sysxmlstrings catalog table. Most of the time the mapping table is active as a lookup table and DB2 has a special purpose mechanism and cache to ensure high lookup performance. DB2’s use of the mapping table leads to significant performance benefits. First of all, it reduces the space that is required to represent XML on pages in table spaces or buffer pools. Second, any query evaluation and traversal of XML documents now operate on integers, not on strings, which is much faster. Since the sysibm.sysxmlstrings table never grows very large, DB2 never deletes or updates any entries in this table. This avoids lock contention on this table and enables high performance. Even REORG or LOAD REPLACE of a user table does not reset the mapping table. Remember that the mapping table contains entries for XML documents in the entire database, and not just for XML documents in a single table. Excessive growth of the mapping table is not a concern, because XML applications do not use an unbounded number of distinct tag names.
3.3 XML Storage in DB2 for Linux, UNIX, and Windows 33 The mapping table is really only for DB2’s internal operation and you cannot modify it. You can however, read this table if you want to get a list of all tag names that ever existed in the database (Figure 3.6). Since version 9.5, DB2 for Linux, UNIX, and Windows stores the tags in a binary format to avoid code page problems in non-Unicode databases. Therefore you need to use the function xmlbit2char to make the strings human-readable. -- DB2 for z/OS and DB2 9 for Linux, UNIX, Windows: SELECT * FROM sysibm.sysxmlstrings; -- DB2 for Linux, UNIX, and Windows, Version 9.5 and higher: SELECT stringid, substr(sysibm.xmlbit2char(string),1,50), is_temporary FROM sysibm.sysxmlstrings; Figure 3.6 Reading XML tag names from sysibm.sysxmlstrings The column IS_TEMPORARY in sysibm.sysxmlstrings only exists in DB2 for Linux, UNIX, and Windows. It indicates whether a tag name belongs to a document that is stored in an XML column (IS_TEMPORARY = 'N') or to an element or attribute that has been newly constructed as part of a query (IS_TEMPORARY = 'Y'). For example, a query that creates and returns a new element name that has never been seen in the database before also causes a new entry in the string table. However this happens only upon its very first execution, after which the new tag is registered and known. You cannot delete or update entries in this catalog table. 3.3 XML STORAGE IN DB2 FOR LINUX, UNIX, AND WINDOWS This and the following sections describe storage objects, such as tables and table spaces, for XML data in DB2 for Linux, UNIX, and Windows. DB2 for z/OS uses similar but slightly different concepts, which are discussed in sections 3.11 through 3.12. 3.3.1 Storage Objects for XML Data Whenever you define a table, DB2 creates one or multiple storage objects in a table space. For example, a relational table structure is stored in a DAT (data) object. Any kind of index is stored as an INX object. If your table contains a LOB column, DB2 creates a separate LOB object. And, if your table contains one or multiple XML columns, there is an XDA (XML data area) object. For SMS (system-managed space) table spaces, these objects appear as separate files in the file system. For DMS (database-managed space) table spaces, which are the default and recommended, these objects are not visible but nevertheless exist in the DMS containers.
34 Chapter 3 Designing and Managing XML Storage Objects WHAT IS A TABLE SPACE? A table space is a storage structure that can contain relational tables and indexes as well as large objects (LOBs) and XML data. Table spaces enable you to specify where your data is physically stored. They also allow you to assign different types of data to different buffer pools in main memory, or to back up and restore specific parts of your database. Let’s look at this CREATE TABLE command as an example (note that no XML Schema is required to define a table with a column of type XML. DB2’s XML storage is independent of any particular XML Schema): CREATE TABLE customer (id INTEGER, info XML) The storage objects that DB2 creates and maintains for this table are illustrated in Figure 3.7. The table with two columns is maintained in a DAT object. The XML column in this table does not contain the actual XML documents that are inserted, but just logical pointers to them. The reason is that XML documents can easily be too big to fit into a relational row on a single page. This approach is similar to the storage of large objects (LOBs) in DB2. The main difference between XML and LOBs is that XML is buffered in the buffer pool whereas LOBs are not. By default, XML documents are stored in the XDA object. If a table has multiple XML columns, all of them share the same XDA object. Whenever a document tree does not fit on a single page, DB2 automatically and transparently breaks the tree into multiple subtrees, which are called regions. Each region is then stored on a separate XDA page so a single document can span many pages. Documents that fit on a single page consist of a single region. If documents are much smaller than the page size, multiple regions (documents) can be stored on a single page so that no space is wasted. DB2 allows you to store XML documents up to 2GB in size, which is large enough for just about every application. One regions index is created automatically by DB2 for each table that contains one or more XML columns. In the catalog view syscat.indexes, every regions index is identified by the value XRGN in the column INDEXTYPE. It is not a user-defined index and you cannot drop it. The regions index contains one entry for each region of a document. If a document consists of multiple regions, then these regions are represented by consecutive regions index entries. An XML document pointer in the XML column in the DAT object points to a regions index entry that in turn points to the “first” region of the corresponding document. This is the region that contains the root node of the document. A short range scan on the regions index then provides pointers to the remaining regions of the document. If a node A in a region has a child node B that is the topmost node of
3.3 XML Storage in DB2 for Linux, UNIX, and Windows 35 another region, node A contains information that points back into the regions index (not shown in Figure 3.7). It points to the regions index entry that leads to the region with node B. Also not shown in Figure 3.7 is that DB2 maintains a path index for every XML column. It contains one entry per unique path in the XML data and is therefore very small. More details on the path index can be found in Chapter 13, Defining and Using XML Indexes. Table Space ID (INT) 1001 1000 1003 1005 INFO (XML) Regions Index pages INX Object page DAT Object page page page page page page page page page page page page XDA Object Figure 3.7 Storage objects involved with an XML column Storing large documents as regions across pages has several advantages. First and foremost, DB2’s proven infrastructure for managing pages works for XML data just like for relational data. This includes table spaces, buffer pools, page cleaning, backup and restore, recovery, HADR, and so on. If a document is large and spans many XDA pages and a query touches only part of the document, DB2 does not necessarily need to bring all pages of the document into the buffer pool. DB2 always strives to split a document into the smallest possible number of regions. The regions for one document are in most cases stored on physically consecutive pages. The way XML documents are broken into regions is completely transparent to the application and to the DBA. You should never attempt to design XML documents with the goal of optimizing any aspect of how DB2 stores the documents. You should model your XML data at the logical level to reflect your business data and focus on the characteristics and requirements of your application, not on how DB2 processes XML. Most applications are best served with large numbers of small documents, where each XML document represents a separate business object.
36 3.3.2 Chapter 3 Designing and Managing XML Storage Objects Defining Columns,Tables, and Table Spaces for XML Data In DB2 for Linux, UNIX, and Windows, database-managed table spaces (DMS) provide higher performance than system-managed table spaces (SMS) for relational data, and even more so for XML read and write access. Since DB2 9, newly created table spaces are DMS by default. It is also recommended to use DMS table spaces with automatic storage so that they grow as needed without manual intervention. A key aspect of physical database design is the page size of a table space. Measurements have shown that the lower the number of regions (splits) per XML document the better the performance, especially for XML insert and full-document retrieval. If a document does not fit on a single page, the number of splits per document depends on the page size (4KB, 8KB, 16KB, or 32KB). The larger the page size of the table space the lower the number of regions per document. For example, let’s say a given document gets split into forty regions across forty 4KB pages. Then it might be possible to store the same document on only twenty 8KB pages, or ten 16KB, or five 32KB pages. If the XML documents are significantly smaller than the selected page size, no space is wasted because multiple small documents can be stored on a single page. The impact of the page size on the number of regions per document is illustrated in Figure 3.8. Since each region requires one regions index entry, a larger page size that allows for fewer regions per document also leads to a smaller regions index. 4K Pages 8K Pages …. 32k Pages Figure 3.8 The number of regions per document depends on the page size
3.3 XML Storage in DB2 for Linux, UNIX, and Windows NOTE 37 Most XML applications perform best using 16KB or 32KB pages. 16KB pages can provide good performance if most documents are quite small (for example, less than 4KB) so that several documents fit on a page. Larger documents are better served by 32KB pages. If you prefer to use a single page size for XML and relational data, or for data and indexes, and you find that 32KB pages are too large for efficient access to relational data or indexes, then 16KB pages can be a good compromise. Let’s look at some examples. Figure 3.9 shows how to define two table spaces, one with 4KB pages and one with 32KB pages. These table spaces are used in the subsequent CREATE TABLE statements and figures. CREATE BUFFERPOOL bpsmall PAGESIZE 4k ; CREATE BUFFERPOOL bplarge PAGESIZE 32k ; CREATE TABLESPACE tbspace4k PAGESIZE 4K MANAGED BY AUTOMATIC STORAGE BUFFERPOOL bpsmall ; CREATE TABLESPACE tbspace32k PAGESIZE 32K MANAGED BY AUTOMATIC STORAGE BUFFERPOOL bplarge ; Figure 3.9 Creating table spaces with different page sizes The CREATE TABLE statement shown in Figure 3.10 defines a table with an integer column and an XML column using the table space with 32KB pages. It places XML data and relational data into the same table space (see Figure 3.7). Consequently, they use the same page size and are buffered in the same buffer pool. This default layout provides good performance for most applications. CREATE TABLE customer(id INTEGER, info XML) IN tbspace32k; Figure 3.10 Creating a table with an XML column in a named table space If you have done a performance analysis and find that you need a large page size for XML data but a small page size for relational data or indexes, you can use separate table spaces to achieve this. When you define a table, you can direct “long” data (LOB and XML data) into a separate table space with a different page size. The corresponding table definition and storage objects are shown in Figure 3.11 and Figure 3.12, respectively. In this example, relational data is stored in a
38 Chapter 3 Designing and Managing XML Storage Objects table space tbspace4k with page size 4KB and XML data is stored in a table space tbspace32k with page size 32KB. If the table also contained a LOB column, the LOB data would be stored in a separate LOB object in the table space tbspace32k. Pages of the LOB object are not buffered in the buffer pool, whereas pages of the DAT, XDA, and INX objects are buffered. CREATE TABLE customer(id INTEGER, info XML) IN tbspace4k LONG IN tbspace32k; Figure 3.11 Storing XML and LOBs in a separate table space tbspace4k ID (INT) 1001 1000 1003 1005 INFO (XML) Regions Index pages INX Object page DAT Object tbspace32k page page page page page page page page page page page page XDA Object Figure 3.12 Storage objects in a separate table spaces If you had another table space named tbspace4kINX you could also direct the regions index as well as any user-defined indexes into their own table space. This layout is shown in Figure 3.13 and Figure 3.14.
3.3 XML Storage in DB2 for Linux, UNIX, and Windows 39 CREATE TABLE customer(id INTEGER, info XML) IN tbspace4k INDEX IN tbspace4kINX LONG IN tbspace32k; Figure 3.13 Defining separate storage for indexes and XML data tbspace4k ID (INT) 1001 1000 1003 1005 tbspace4kINX INFO (XML) Regions Index pages INX Object page DAT Object tbspace32k page page page page page page page page page page page page XDA Object Figure 3.14 Separate table spaces for relational data, XML, and indexes In general, the fewer distinct page sizes and buffer pools you create the easier it is to tune and maintain your database. Therefore we recommend that you use different page sizes for XML and relational data only if you have evidence that it improves the performance of your workload and if you need this performance gain to meet the performance requirements of your application. Otherwise there is benefit in keeping it simple. Dedicated measurements in a prototype and test workload can help you make such decisions. Since DB2 9, new table spaces are by default large table spaces, in which the number of rows per page is no longer limited to 255. Hence, you don’t need to choose a small page size for relational data to ensure that pages are filled up and space isn’t wasted.
40 3.3.3 Chapter 3 Designing and Managing XML Storage Objects Dropping XML Columns In DB2 9.1 and DB2 9.5 for Linux, UNIX, and Windows you cannot drop XML columns from a table. To remove an XML column, create a new table without the XML column and use a “load from cursor” to move data from the old table to the new table. Then drop the old table and rename the new table so that it assumes the name of the old table. Alternatively, you can export data from a table and then recreate and reload the table. DB2 9.7 for Linux, UNIX, and Windows allows you to drop XML columns from a table with the ALTER TABLE statement. If a table contains multiple XML columns you can only drop all XML columns at the same time. 3.3.4 Improved XML Storage Format in DB2 9.7 DB2 9.7 uses a more optimized tree format for XML storage than prior releases. This improved format is completely transparent to all database operations such as queries, inserts, updates, indexing, and schema validation. The improved XML format is used only in tables that are created in DB2 9.7 or higher. When you migrate a table with XML data from DB2 9 or 9.5 to DB2 9.7, this XML data remains in its previous format and is not changed. Documents that you newly insert or update in such a migrated table continue to be in the format of the previous DB2 release. The previous and the improved storage format are not mixed within the XDA object of a table. The new storage format has the following benefits: • It is more compact and can reduce the space consumption of your XML data. • It allows compression of XML data in the XDA object (see section 3.5). • It allows you to use the function ADMIN_EST_INLINE_LENGTH to estimate the inline length that would allow an XML document to be inlined (see section 3.4). • It enables faster redistribution of XML data in a partitioned database; that is, you can use the NOT ROLLFORWARD RECOVERABLE option in the REDISTRIBUTE command to redistribute data in bulk and avoid logging. If you have migrated a table with XML data from DB2 9 or 9.5 to DB2 9.7 and want to bring the XML data into the new format, you need to create a new table and copy the data from the old to the new table. You can use “load from cursor” for moving data from one table to another efficiently. Then you can drop the old table and rename the new table to the old table name. Starting with DB2 9.7, copying and renaming a table can be done more elegantly and with minimal downtime by using the procedure SYSPROC.ADMIN_MOVE_TABLE. This procedure performs an online table move, which means that table data is copied to a table object with the same name, but not necessarily the same columns and storage characteristics. When the copying is complete, the source table is briefly taken offline and its name is assigned to the new copy of the table. All indexes of the table are also copied. During the copy phase, any updates, inserts, or deletes on the
3.4 Using XML Base Table Row Storage (Inlining) 41 source table are collected in a staging table and finally applied to the new table. An online table move with XML data requires that the table has at least one unique index and does not participate in foreign key constraints. 3.4 USING XML BASE TABLE ROW STORAGE (INLINING) From DB2 9.5 for Linux, UNIX, and Windows onwards, XML documents that are small enough to fit on a single page can be stored on the same page as the relational row that they belong to. This capability is called base table row storage, or inlining. It means that the tree structure of an XML document is no longer stored on a separate XDA page, but next to the relational data inside the DAT object in the table space (Figure 3.16). XML inlining is currently not available in DB2 for z/OS. Inlining needs to be explicitly enabled as a column option because it may or may not provide performance benefits. Before we discuss the performance trade-offs, Figure 3.15 shows how to create a table with inlined XML storage. You can add a column option INLINE LENGTH to the definition of an XML column. In this example, any XML document that can be stored within 30,000 bytes is inlined. Documents that require more than 30,000 bytes are stored in the regular way (on separate XDA pages). The inlining of some or all documents is handled by the DB2 engine and completely transparent to the application. DB2’s decision about whether a given document is within the inline length is based on the size of the document in DB2’s internal tree format, after XML parsing. The decision is not based on the length of the textual (serialized) representation of the XML document. Inlined documents can be compressed, but the inlining decision is based on their space requirement prior to compression. CREATE TABLE customer(id INTEGER, info XML INLINE LENGTH 30000) IN tbspace32k; Figure 3.15 Table definition with inlined XML storage The maximum allowed value for the inline length depends on the page size of the table space. As a rule of thumb, the inline length has to be less than the page size minus the total length of the other columns in the table and the overhead for the page header, and so on. For example, the maximum possible inline length in the example in Figure 3.15, where the table also contains an integer column and uses 32KB pages, is 32667 bytes. If an XML document is updated it might become larger or smaller as a result of the update, which affects inlining. The update may cause a previously inlined document to be moved from the DAT object to the XDA object, or vice versa. Figure 3.16 illustrates the storage objects in the table space when XML inlining is used. Three of the four documents meet the inline length and are now stored as part of the relational rows on pages in the DAT object. They do not have regions index entries. The document that belongs to the
42 Chapter 3 Designing and Managing XML Storage Objects second row (id = 1000) is too large to be inlined. It is stored in the XDA object and spans three pages, which are linked from the row in the DAT object via the regions index. Note that inlining makes the DAT object larger, with larger and fewer rows per page. The XDA object has become smaller and the regions index has fewer entries than without inlining. Table Space: tbspace32k ID (INT) 1001 INFO (XML) Regions Index 1000 page pages INX Object 1003 page 1005 page page page page page page page DAT Object XDA Object Figure 3.16 Storage objects with XML inlining The CREATE TABLE statement in Figure 3.17 creates the customer table in table space tbspace4k, allows documents up to 3500 bytes to be inlined, and automatically directs larger documents to the table space tbspace32k. In this case the inlining takes precedence over the LONG IN clause. If a document is small enough to be inlined it will be part of the base table row and stored on DAT pages in tbspace4k. Otherwise it is stored on XDA pages in tbspace32k. CREATE TABLE customer(id INTEGER, info XML INLINE LENGTH 3500) IN tbspace4k LONG IN tbspace32k; Figure 3.17 Another table definition with inlined XML storage The inline length of an XML column can be changed with an ALTER TABLE statement, as shown in Figure 3.18. This allows you to increase the inline length of an XML column, or to enable inlining for an XML column that wasn’t previously defined with inlining.
3.4 Using XML Base Table Row Storage (Inlining) 43 ALTER TABLE customer ALTER COLUMN info SET INLINE LENGTH 3600; Figure 3.18 Changing the inline length of an XML column The ALTER TABLE statement operation does not affect existing documents in the table, only documents that are inserted, loaded, or updated after the ALTER TABLE statement has been issued. If you want existing documents to obey the newly set inline length, you need to update them with themselves, as shown in Figure 3.19. Be aware that a bulk update of many XML documents can require a lot of log space. You might have to perform a series of smaller updates and commit frequently to avoid running out of log space. After you use an UPDATE statement to move XML data from the XDA object to the DAT object, you might want to reorganize the table to reclaim the freed-up space in the XDA object (see section 3.7). However, reorganization by itself does not move XML data from the XDA object to the DAT object. UPDATE customer SET info = info; Figure 3.19 Updating existing documents to apply inlining After you have specified an inline length for an XML column, you can only increase the inline length, not reduce it. The only way to “undo” the inlining of XML documents is to copy the documents into a new table without inlining, drop the old table, and rename the new table to the old table name. Starting with DB2 9.7 you can do this copying also with the procedure SYSPROC. ADMIN_MOVE_TABLE. 3.4.1 Monitoring and Configuring XML Inlining After you have set the inline length for an XML column, any newly inserted or updated document is inlined if DB2’s internal tree representation of the document fits within the specified inline length. The size of an XML document in DB2’s internal tree format depends on the actual document characteristics, such as the length of element names, the length of element values, the presence of namespaces, and other factors. In particular, the space required to store a document in an XML column might be less than or greater than the size of the document in its textual representation. In DB2 9.5 and higher, the space requirement of most XML documents is between 70% and 150% of the space that they occupy in the file system. Therefore predicting whether a particular document will or will not be inlined can be difficult. Similarly, choosing an inline length that allows inlined storage of all or most documents can also be difficult. To address this problem, DB2 9.7 for Linux, UNIX, and Windows introduced the scalar functions ADMIN_IS_INLINED and ADMIN_EST_INLINE_LENGTH.
44 Chapter 3 Designing and Managing XML Storage Objects The function ADMIN_IS_INLINED takes an XML column name as input, and returns • 1 if the document in the current row of the XML column is inlined. • 0 if the document in the current row of the XML column is not inlined. • NULL if the XML column of the current row is NULL. The query in Figure 3.20 shows how the function ADMIN_IS_INLINED can be used to examine a table with inlining, like the one defined previously in Figure 3.17. The query reveals for every document in the table whether or not it is inlined. The output indicates that the documents with the relational id values 1000 and 1002 are inlined while the other documents are not inlined. SELECT id, ADMIN_IS_INLINED(info) AS inlined FROM customer; ID INLINED ---------------- ---------------1000 1 1001 0 1002 1 1003 0 1004 0 1005 0 6 record(s) selected. Figure 3.20 Determining which documents are inlined Since the query in Figure 3.20 can produce a lot of output when applied to a large table, you may want to add a WHERE clause to retrieve the inlining status only for a subset of documents. Figure 3.21 uses the ADMIN_IS_INLINED function to compute the number of documents that are inlined as well as the number of those that are not. The subselect in Figure 3.21 uses the clause FETCH FIRST 1000 ROWS ONLY to obtain inlining information based on at most 1,000 documents. This can be useful if the input table is large and you want to use the first 1,000 documents as a representative sample rather than scanning the entire table. Alternatively, you could use the keywords TABLESAMPLE BERNOULLI(n) in the FROM clause of the subselect to sample n% of all rows in the table.
3.4 Using XML Base Table Row Storage (Inlining) 45 SELECT COUNT(*) AS doc_count, CASE WHEN inlined = 1 THEN 'Yes' ELSE 'No' END AS inlined FROM (SELECT ADMIN_IS_INLINED(info) AS inlined FROM customer FETCH FIRST 1000 ROWS ONLY) GROUP BY inlined; DOC_COUNT ---------------2 4 INLINED ---------------Yes No 2 record(s) selected. Figure 3.21 Obtaining the number of inlined documents The result in Figure 3.21 shows that only two out of six examined documents are inlined. This raises the question of how much you would need to increase the inline length so that most or all of the documents can be inlined. Similarly, you might have a table with an XML column for which inlining is not yet enabled. You might wonder which inline length to use so that most or all of the documents in that column get inlined. The function ADMIN_EST_INLINE_LENGTH is designed to answer these questions. The function ADMIN_EST_INLINE_LENGTH takes an XML column name as input, and returns • The lowest inline length (in bytes) that would allow the XML document in the current row to be inlined. This is an estimated value. • –1 , if the document in the current row of the XML column is too large to be inlined for the given page size. • –2 , if the required inline length cannot be estimated for the document in the current row. This is the case for any documents that have been inserted and stored prior to DB2 9.7 because DB2 9.7 uses a more optimized XML storage format (see section 3.3.4). • NULL, if the XML column of the current row is NULL. Figure 3.22 shows sample output of the function ADMIN_EST_INLINE_LENGTH. The values returned depend on the actual XML data in the table. In this example, the output shows that the first document (relational id = 1000) is already inlined and its actual size in DB2’s internal format is 770 bytes. The second document (id = 1001) is not inlined, but it can be inlined if the inline length is increased to 2345 or larger. The document with id = 1005 cannot be inlined because it is too large to fit on a single page together with the other columns in the table.
46 Chapter 3 Designing and Managing XML Storage Objects SELECT id, ADMIN_IS_INLINED(info) AS inlined, ADMIN_EST_INLINE_LENGTH(info) AS inline_length FROM customer; ID INLINED INLINE_LENGTH ---------------- ---------------- --------------1000 1 770 1001 0 2345 1002 1 796 1003 0 1489 1004 0 1910 1005 0 -1 6 record(s) selected. Figure 3.22 Examining the required inlined length for specific XML documents For a proposed inline length, such as 1500 bytes, the query in Figure 3.23 tells you how many documents in the column would be inlined if this inline length was used. SELECT COUNT(*) AS doc_count FROM customer WHERE ADMIN_EST_INLINE_LENGTH(info) BETWEEN 0 AND 1500; DOC_COUNT ---------------3 1 record(s) selected. Figure 3.23 Estimating the effectiveness of a proposed inline length Figure 3.24 gives an example of a more comprehensive report on the distribution of document sizes in a table. It shows that two documents require no more than 1000 bytes each, four documents can be stored in at most 2000 bytes each, five fit into 3000 bytes each, no potentially “inlinable” document is larger than 3000 bytes, and one document is too big to be inlined.
3.4 Using XML Base Table Row Storage (Inlining) SELECT SUM(a) AS "<= 1000", SUM(b) AS "<= 2000", SUM(c) AS "<= 3000", SUM(d) AS "> 3000", SUM(e) AS "too big" FROM( SELECT CASE WHEN len > 0 AND len <= 1000 THEN 1 END CASE WHEN len > 0 AND len <= 2000 THEN 1 END CASE WHEN len > 0 AND len <= 3000 THEN 1 END CASE WHEN len > 3000 THEN 1 END CASE WHEN len = -1 THEN 1 END FROM ( SELECT ADMIN_EST_INLINE_LENGTH(info) AS len FROM customer) ); 47 AS AS AS AS AS a b c d e , , , , <= 1000 <= 2000 <= 3000 > 3000 too big ----------- ----------- ----------- ----------- ----------2 4 5 0 1 1 record(s) selected. Figure 3.24 Analyzing the distribution of document sizes In Figure 3.24, if you replace ADMIN_EST_INLINE_LENGTH(info) with LENGTH(XMLSERIALIZE(info AS CLOB)) then you obtain information about the textual (serialized) size of the documents instead of their parsed size inside DB2’s storage area. Since inlined XML documents reside on regular data pages instead of XDA pages, their read and write activity is reflected in the snapshot monitor counters for data pages, just like any other relational activity. Reads and writes to non-inlined XML documents affect XDA pages and are reported in separate snapshot monitor counters, such as Buffer pool XDA logical reads as opposed to Buffer pool data logical reads. This difference in page counters does not affect performance, only the way you monitor XML activity in the database. 3.4.2 Potential Benefits and Drawbacks of XML Inlining Inlining of XML data in the base table has several consequences that you should be aware of. When in doubt, it is advisable to perform tests with your XML data and workload to determine whether inlining is beneficial in your environment. Potential benefits of inlining include • Since inlined documents are stored on the relational data pages and never span more than one such page, they do not require any regions index entries. If most or all of the documents are inlined, the regions index will be very small. This saves storage space and can improve performance.
48 Chapter 3 Designing and Managing XML Storage Objects • Since inlined XML documents reside on the data pages of the table, they participate in DB2’s prefetching. Prefetching can significantly improve the performance of queries that read many documents from a table, but it is of little or no benefit to queries that fetch only a single document. • If you use DB2 9.5 or later and enable row compression, all inlined documents will be compressed. This is the only way to compress XML data in DB2 9.5. If your system tends to be I/O bound, compression can improve performance dramatically. Compression allows DB2 to use fewer I/O operations to read the same amount of data. Since compressed data pages remain compressed in the buffer pool, a larger number of rows (documents) are kept in your buffer pool. Potential drawbacks of inlining are • Since inlined documents are stored within the relational rows of the table, the row size is a lot larger than without inlining. As a result, the number of rows that are stored on a single page is much lower. It can be as low as one row per page if most of your inlined documents occupy the majority of the page they each reside on. Queries that read data from the non-XML columns of the table need to access a much larger number of pages than without inlining. This can be detrimental for performance. • XML queries and updates on inlined documents can use more temporary space at execution time than if the documents were not inlined. If the buffer pool for the temporary table space is large enough then this does not necessarily incur additional physical I/O and the performance impact is low to moderate. It is highly recommended to use a dedicated buffer pool for the temporary table space. 3.5 COMPRESSING XML DATA The ability to compress XML data in DB2 depends on the version of DB2 you are using. DB2 for z/OS compresses XML data if compression is enabled for the table space that contains the table. Compression can be enabled with an ALTER TABLESPACE statement and then takes effect after the first reorganization of the table space. Further details are provided in section 3.11. DB2 for Linux, UNIX, and Windows supports XML compression as follows: • DB2 9.1 does not support XML compression. • DB2 9.5 allows you to compress all documents that are inlined. • DB2 9.7 supports compression of inlined documents as well as those stored in the XDA object. Compression of the XDA object is only supported for tables and XML columns created in DB2 9.7, not for XML columns created in prior releases (see section 3.3.4).
3.5 Compressing XML Data 49 • DB2 9.7 also supports compression of user-defined XML indexes. They are compressed automatically if the table itself is compressed, unless you alter them explicitly to disable compression. The XML regions index is never compressed. Figure 3.25 shows a CREATE TABLE statement that inlines XML documents of up to 30,000 bytes in size and enables compression. After the table is populated with an initial amount of data (about 1 MB to 2 MB), a compression dictionary is automatically created and any subsequent data that is inserted, loaded, or updated is subject to compression. The rows that were in the table before the compression kicked in are not compressed until an offline reorganization of the table is performed. CREATE TABLE customer(id INTEGER, info XML INLINE LENGTH 30000) IN tbspace32k COMPRESS YES; Figure 3.25 Table definition with inlined XML storage and compression In DB2 9.5, XML documents that are too large for inlining are excluded from compression because only the DAT object is compressed, not the XDA object. In DB2 9.7, the table definition in Figure 3.25 compresses all documents, including those that are not inlined. DB2 9.7 compresses both the DAT and the XDA object and uses separate compression dictionaries for both. Since DB2 9.5 you can use the administrative view sysibmadm.admintabcompressinfo to check how well the data in a table is being compressed, as shown in Figure 3.26. SELECT tabname, pages_saved_percent, bytes_saved_percent FROM sysibmadm.admintabcompressinfo WHERE tabname = 'CUSTOMER'; TABNAME PAGES_SAVED_PERCENT BYTES_SAVED_PERCENT ---------- ------------------- ------------------CUSTOMER 67 67 1 record(s) selected. Figure 3.26 Checking how well a table is compressed in DB2 9.5 In DB2 9.7, where the DAT and XDA storage objects are compressed separately, the view sysibmadm.admintabcompressinfo has the additional column object_type. This column allows you to examine the compression ratio of the DAT and XDA objects separately (Figure 3.27).
50 Chapter 3 Designing and Managing XML Storage Objects SELECT tabname, object_type, pages_saved_percent, bytes_saved_percent FROM sysibmadm.admintabcompressinfo WHERE tabname = 'CUSTOMER'; TABNAME ---------CUSTOMER CUSTOMER OBJECT_TYPE PAGES_SAVED_PERCENT BYTES_SAVED_PERCENT ----------- ------------------- ------------------DATA 67 67 XML 66 66 2 record(s) selected. Figure 3.27 Checking how well a table is compressed in DB2 9.7 Additional information is available in the view sysibmadm.admintabinfo (Figure 3.28). Its columns DICTIONARY_SIZE and XML_DICTIONARY_SIZE reveal the existence and size of the compression dictionaries in the DAT and the XDA object, respectively. The column XML_RECORD_TYPE has the value 2 if the table with the XML column was created in DB2 9.7 and allows XDA compression. It has the value 1 if the table was created prior to 9.7, and NULL if the table does not have an XML column. SELECT tabname, dictionary_size, xml_dictionary_size, xml_record_type FROM sysibmadm.admintabinfo WHERE tabname = 'CUSTOMER'; TABNAME DICTIONARY_SIZE XML_DICTIONARY_SIZE XML_RECORD_TYPE -------- ----------------- -------------------- --------------CUSTOMER 7720 24048 2 1 record(s) selected. Figure 3.28 Checking dictionary sizes Compression information for tables and indexes is also available through the following table functions, which can also estimate the gain from compressing a currently uncompressed table or index: • sysproc.admin_get_tab_compress_info in DB2 9.5 • sysproc.admin_get_tab_compress_info_v97 in DB2 9.7 • sysproc.admin_get_index_compress_info in DB2 9.7 You can also query the DB2 system catalog to get information about the compression ratio of tables and indexes. This approach requires that you first use the RUNSTATS command to collect statistics about the table: RUNSTATS ON TABLE db2admin.customer AND INDEXES ALL
3.6 Examining XML Storage Space Consumption 51 Note that the RUNSTATS command requires a two-part table name, which consists of a relational schema name (db2admin in this case) as well as table name (customer). After that, the queries in Figure 3.29 retrieve compression information from the catalog. Note that the first query only reports on the compression savings in the DAT object and ignores the XDA object. The second query obtains compression information for all indexes on the customer table, and can only be run in DB2 9.7 or higher. In Chapter 13, Defining and Using XML Indexes, you will learn that an XML index consists of a logical and a physical index, and that the compression ratio is only reported for the physical index. SELECT tabname, avgrowcompressionratio, pctpagessaved FROM syscat.tables WHERE tabname = 'CUSTOMER'; SELECT indname, tabname, indextype, pctpagessaved FROM syscat.indexes WHERE tabname = 'CUSTOMER'; Figure 3.29 Obtaining compression information from the system catalog 3.6 EXAMINING XML STORAGE SPACE CONSUMPTION The space consumption of XML documents in a table space depends on a variety of factors, including the structural characteristics of the documents, the ratio of tags to data values, the presence and number of namespaces, and the ratio of elements to attributes. In DB2 9 and 9.5 for Linux, UNIX, and Windows the space consumption also depends on whether the documents are validated against an XML Schema upon insert, load, or update. Validated documents can take more space than non-validated documents due to type annotations, depending on the nature of the XML Schema. This impact of validation on the space consumption does not exist in DB2 for z/OS and has been removed in DB2 9.7 for Linux, UNIX, and Windows, too. Document validation is covered in Chapter 17, Validating XML Documents against XML Schemas. In general, it is difficult to predict the exact amount of storage space that a particular document, or a set of documents, will occupy in a DB2 table space. To get a reliable estimate you can insert a representative set of documents into a table and check the number of pages occupied in DB2. Clearly, if you use several hundred or thousand documents you get a better estimate than if you use just a handful of documents. You have multiple ways to check the number of pages used. In DB2 for Linux, UNIX, and Windows, one option is to use the list tablespaces command: list tablespaces show detail
52 Chapter 3 Designing and Managing XML Storage Objects This command reports the number of free and used pages for each table space. Figure 3.30 shows sample output of the list tablespaces command for a table space named tbspace32k. You see that the page size is 32KB and 1248 pages are occupied. Multiplying these two figures reveals that the space consumption is 39MB. This is the total for all tables and indexes in the table space. You might wonder why the number of usable pages is lower than the total number of pages in the table space. That is because DB2 reserves some pages for free space management and other housekeeping information. Tablespace ID Name Type Contents State Detailed explanation: Normal Total pages Useable pages Used pages Free pages High water mark (pages) Page size (bytes) Extent size (pages) Prefetch size (pages) Number of containers Figure 3.30 = = = = = 5 TBSPACE32K Database managed space All permanent data. Large table space. 0x0000 = = = = = = = = = 2048 2016 1248 768 1312 32768 32 32 1 Detailed table space information If you have multiple tables in one table space and prefer to see space information for each individual table, you can use the following commands where <dbname> is the name of your database: update monitor switches using table on; get snapshot for tables on <dbname>; This command produces information for all tables that have been recently accessed. Figure 3.31 shows the information provided for the table customer. The number of used pages is reported separately for data pages, index pages, and XDA pages. Table Schema Table Name Table Type Data Object Pages Index Object Pages Xda Object Pages ... Figure 3.31 = = = = = = DB2ADMIN CUSTOMER User 27 32 192 Information from a table snapshot
3.7 Reorganizing XML Data and Indexes 53 If you are familiar with managing relational data in DB2 you might know that after using the RUNSTATS utility on a table you can also obtain the number of pages for that table by querying the DB2 system catalog as shown in Figure 3.32. SELECT npages FROM syscat.tables WHERE tabname = 'CUSTOMER'; Figure 3.32 Obtaining the number of pages for a table Beware that Figure 3.32 only reports the number of data pages and does not include XDA or index pages. Hence, this number can be misleading if a table includes XML columns. It is accurate only if all documents in the table are inlined and reside on data pages only, not on XDA pages. NOTE After you have determined how much space your XML documents occupy in DB2, you can compare that number to how much space the same documents consume in their textual format in the file system. In DB2 9.5 and higher you find that the space consumption for most XML data is between 70% and 150% of the space that they occupy in the file system. The exact ratio depends on the characteristics of your documents. For example, since DB2’s internal storage format replaces all XML tag names with four-byte integer values, documents with very long tag names lose more characters when being stored in DB2 than documents with very short tag names. 3.7 REORGANIZING XML DATA AND INDEXES When you delete or update XML documents, free space remains where the documents were previously stored. This space can be reused when other documents are inserted or updated. Hence, there is little need to reorganize XML data if your workload is a somewhat balanced mix of insert, update, and delete operations. You may wish to reorganize a table with XML data if many update and delete operations have taken place and you want to reclaim space. Reorganization has no impact on the tree structure in which XML documents are stored. Reorganization is performed in DB2 for z/OS using the REORG utility (see section 3.12) and in DB2 for Linux, UNIX, and Windows using the REORG command, which is discussed in this section. The REORG command DB2 for Linux, UNIX, and Windows supports three operating modes, two of which support reorganizing XML data: • Offline REORG with no table access supports XML data. • Offline REORG with read-only table access supports XML data.
54 Chapter 3 Designing and Managing XML Storage Objects • Online REORG, also known as in-place REORG, which allows full read and write access to the table. This reorganization mode is not supported for XML data. The main effect of the REORG command on XML data is that the space left behind by deleted or updated documents is reclaimed if the LONGLOBDATA option of the REORG command is used. If the LONGLOBDATA option is omitted, only the DAT object and the indexes are reorganized while the XDA object is ignored. The following command reorganizes the customer table including its XML data without allowing concurrent access to the table: REORG TABLE customer ALLOW NO ACCESS LONGLOBDATA To reorganize the customer table and its XML data and permit read-only access to the table during the REORG operation, issue this command: REORG TABLE customer ALLOW READ ACCESS LONGLOBDATA If you omit the ALLOW NO ACCESS or ALLOW READ ACCESS clause then ALLOW READ ACCESS is the default for non-partitioned tables while ALLOW NO ACCESS is the default for partitioned tables. You can reorganize relational and XML indexes separately from the table. DB2 9 and 9.5 allow you to reorganize XML indexes while users continue to have read-only access to the table. Use the following command to reorganize indexes with read access: REORG INDEXES ALL FOR TABLE customer ALLOW READ ACCESS DB2 9.7 supports online reorganization for XML indexes, which means that applications have read and write access to the table while indexes are being reorganized. Use the following command to reorganize indexes in online mode: REORG INDEXES ALL FOR TABLE customer ALLOW WRITE ACCESS 3.8 UNDERSTANDING XML SPACE MANAGEMENT: A COMPREHENSIVE EXAMPLE This section walks you through a comprehensive example of XML space management in DB2 for Linux, UNIX, and Windows. Let’s look at the example in Figure 3.33, which examines the effects of XML inlining, compression, and reorganization on the storage objects of a table. The example performs the following steps:
3.8 Understanding XML Space Management: A Comprehensive Example 55 1. Create a table 2. Import/Load XML data into the table 3. Get a snapshot to examine the number of pages used 4. Alter the table to enable inlining 5. Update all documents to physically move them from the XDA into the DAT object 6. Get a snapshot again to verify the increase of DAT pages 7. Reorganize the table to free up the empty XDA pages that got left behind after inlining 8. Get a snapshot to verify the reduction of XDA pages 9. Enable compression and reorganize the table again 10. Get a snapshot to verify the reduction of DAT pages after compression Figure 3.33 shows the output of these steps when performed in the DB2 Command Line Processor. We have truncated the output of the GET SNAPSHOT commands to only show the relevant portion. We have also added some comments [in italics and square brackets] as additional explanations. db2 => create table customer (id int, info XML) in TBSPACE32K; DB20000I The SQL command completed successfully. [import/load a batch of 20480 small documents] db2 => select count(*) as num from customer; num ----------20480 1 record(s) selected. db2 => get snapshot for tables on sampxml; Table Schema Table Name Table Type Data Object Pages Index Object Pages Xda Object Pages Rows Read Rows Written = = = = = = = = DB2ADMIN CUSTOMER User 60 38 [All XML data resides in the XDA object] 880 0 20480 db2 => alter table customer alter info set inline length 30000; DB20000I The SQL command completed successfully. [Altering the inline length does not affect existing documents.] [Issue an update statement to rewrite all documents:] Figure 3.33 (continues) Examining the effects of XML inlining, compression, and reorganization
56 Chapter 3 Designing and Managing XML Storage Objects db2 => update customer set info = info; DB20000I The SQL command completed successfully. db2 => get snapshot for tables on sampxml; Table Schema Table Name Table Type Data Object Pages Index Object Pages Xda Object Pages = = = = = = DB2ADMIN CUSTOMER User 896 38 880 Rows Read Rows Written = 61435 = 40960 [XML docs have been inlined] [Regions index pages are now empty] [XDA pages are now empty, which we prove with a reorg] db2 => reorg table customer LONGLOBDATA ; DB20000I The REORG command completed successfully. db2 => get snapshot for tables on sampxml; Table Schema = DB2ADMIN Table Name = CUSTOMER Table Type = User Data Object Pages = 887 Index Object Pages = 5 [Regions index pages freed up] Xda Object Pages = 1 [XDA pages have been freed up] Rows Read = 122880 Rows Written = 0 Table Reorg Information: Reorg Type = Reclaiming Table Reorg Allow Read Access Reorg Long Field LOB Data [Enable compression:] db2 => alter table customer compress yes; DB20000I The SQL command completed successfully. [Then reorg to actually compress the existing data in the table:] db2 => reorg table customer resetdictionary; DB20000I The REORG command completed successfully. db2 => get snapshot for tables on sampxml; Table Schema Table Name Table Type Data Object Pages Index Object Pages Xda Object Pages Rows Read Rows Written Figure 3.33 = = = = = = = = DB2ADMIN CUSTOMER User 106 [Compression ratio 8:1] 5 1 184320 0 Examining the effects of XML inlining, compression, and reorganization (Continues)
3.9 XML in Range Partitioned Tables and MDC Tables 57 Table Reorg Information: Reorg Type = Reclaiming [The reorg was a reclaiming reorg.] Table Reorg Allow Read Access Reset Compression Dictionary Reorg Data Only Rows compressed = 20480 Figure 3.33 3.9 Examining the effects of XML inlining, compression, and reorganization (Continued) XML IN RANGE PARTITIONED TABLES AND MDC TABLES Range partitioning (also known as table partitioning), as well as multidimensional clustering (MDC), are methods for the storage organization of a database table in DB2 for Linux, UNIX, and Windows. These methods can improve the performance and manageability of large tables. Starting with DB2 9.7, XML columns are allowed in range partitioned tables and MDC tables. 3.9.1 XML and Range Partitioning Range partitioning allows you to horizontally partition a table based on the values in one or multiple columns. Each partition is stored in a separate storage object and in its own table space, which can significantly improve the manageability as well as the performance of large tables. For example, if a table has a column of type DATE, such as an order date or a booking date, you could choose to have all rows with dates in January in one partition, all rows for February in a second partition, and so on. Alternatively you could decide to have one partition per week. Although range partitioning by date is most common, you could also partition a table by product code, last name, zip code, or other information. Range partitioning has various benefits. For example, if a table is partitioned by date you can attach (roll-in) a new partition with new data, and detach (roll-out) the oldest partition if that data is no longer required in the table. This rolling in and rolling out of data allows you to make new data available quickly. It also allows you to quickly remove old data without bulk delete operations that can be time consuming and require substantial logging overhead. After a partition has been detached it is still available as a separate table if required for processing. Additionally, if queries include predicates on the partitioning columns, the DB2 optimizer can intelligently exclude non-relevant partitions from being scanned. This optimization is called partition elimination and improves performance. The partitioning key has to consist of one or multiple relational columns; it cannot be an element or an attribute in an XML column nor the whole XML column. The XML column is payload in the rows of a range partitioned table, which can be very useful if you are managing very large numbers of XML documents and need roll-in/roll-out capabilities for your XML data. For example, a tax processing system may store the filing date in a relational column and the tax return in
58 Chapter 3 Designing and Managing XML Storage Objects an XML column. An order management system may store the order date in a relational column and the actual order as XML. Figure 3.34 shows a sample of a range-partitioned order table with one partition per quarter. CREATE TABLE orders(id INT, orderdate DATE, order XML) PARTITION BY RANGE(orderdate) (PARTITION "1Q09" STARTING '1/1/2009', PARTITION "2Q09" STARTING '4/1/2009', PARTITION "3Q09" STARTING '7/1/2009', PARTITION "4Q09" STARTING '10/1/2009' ENDING '12/31/2009'); Figure 3.34 A range-partitioned table with XML Starting with DB2 9.7, indexes on range-partitioned tables can be defined as global or local indexes. Prior to DB2 9.7, indexes on range-partitioned tables are always global indexes. A global index consists of a single non-partitioned index structure for the entire table. A local index has as many partitions as the table and each index partition contains index entries only for its corresponding table partition. This partitioning schema can improve the manageability of indexes on partitioned tables when you need to roll data in or out, as compared to global indexes. Partitioned indexes can be stored in a different or the same table space as the data partitions. The internal regions index of a range-partitioned table is always a local (partitioned) index. If you attach a new table partition to a table that contains an XML column, DB2 automatically uses the regions index of the new table partition as a new regions index partition, without rebuilding the regions index. Similarly, when you detach a table partition, the corresponding regions index partition becomes the regions index of the detached table. The internal XML path index is always a global (non-partitioned) index even if the table is rangepartitioned. When you attach a new table partition to a table that contains an XML column, DB2 immediately maintains the path index for the target table. This behavior is different from other non-partitioned indexes, which are maintained during SET INTEGRITY after the ATTACH operation. When the ATTACH operation is completed, the existing XML path index of the attached table partition is dropped, because it is superseded by the updated non-partitioned path index on the entire table. When you detach a partition of a table that contains an XML column, a new and separate path index is created for the detached partition, because the XML data in the detached table is not accessible without a path index. 3.9.2 XML and Multidimensional Clustering Multidimensional clustering (MDC) provides a method for clustering data in tables along one or multiple dimensions. MDC tables can significantly improve query performance and reduce the overhead of data maintenance operations such as inserting and deleting. Similar to range partitioning, an MDC table can contain one or multiple XML columns as payload. However, the table cannot be clustered based on values in an XML column. The clustering key has to consist of one or multiple relational columns.
3.10 XML in a Partitioned Database (DPF) 59 The table created in Figure 3.35 defines a table with sales data, including information about the date, store, and product of the sale; plus additional details in an XML column. The table is clustered by date, storeid, and productid, which means that rows reside in the same blocks of the table if they describe different sales of the same product in the same store and on the same date. CREATE TABLE sales(id INT, date DATE, storeid INT, productid INT, details XML) ORGANIZE BY DIMENSIONS(date, storeid, productid); Figure 3.35 An MDC-table with XML The way you use and manage range-partitioned tables or MDC tables is the same whether they include XML columns or not. Since there are no XML-specific considerations for these storage methods, we do not cover this topic in any more detail. The interested reader should refer to the DB2 documentation or general DB2 database administration books for further information on range-partitioning and multidimensional clustering. 3.10 XML IN A PARTITIONED DATABASE (DPF) The Database Partitioning Feature (DPF) in DB2 for Linux, UNIX, and Windows allows you to create a database that consists of multiple partitions. Each database partition, also known as a node or database node, can reside on a physically separate server. It is also possible to have multiple database partitions on a single server. A database table can reside in a single partition or can be distributed across multiple database nodes. When a table is assigned to multiple partitions, its rows are randomly distributed across the partitions by a hash function. This data distribution allows multiple processors and multiple machines to work in parallel to execute queries and other processing tasks. Extensive parallelization allows you to run complex queries over large amounts of data with shorter response times than in a single partition database. This benefit is particularly important for data warehousing and complex analytical queries. Starting with DB2 9.7, tables with XML columns can be created in a partitioned database to enable parallel processing of XML data. You can also add XML columns to existing partitioned tables in a DPF database. Designing, configuring, and managing a partitioned database is largely the same whether it includes XML data or not. Hence, this section focuses on the XML-specific consideration for partitioned databases. To create a table in a partitioned database and distribute its data across a set of database partitions, you need to take steps that are illustrated in Figure 3.36. If your table should be distributed across the database partitions 1 through 16, create a database partition group with the corresponding range of partition numbers. Then you create a buffer pool and a table space for this partition group. Subsequently, any table that you create in this table space is distributed across the underlying database partitions. More precisely, each row that is added to the table is stored on exactly one of the underlying database partitions. A single row never spans two or more partitions. Each row
60 Chapter 3 Designing and Managing XML Storage Objects is assigned to one of the partitions by hashing on a distribution key. In Figure 3.36, the clause DISTRIBUTE BY HASH specifies that the rows of the customer table are distributed based on the values in the id column. The XML documents are distributed together with the rows that they belong to. Each XML document resides in its entirety on exactly one partition. -- create a database partition group CREATE DATABASE PARTITION GROUP group1 ON DBPARTITIONNUMS (1 TO 16); -- create a buffer pool for the partition group CREATE BUFFERPOOL bp_group1 DATABASE PARTITION GROUP group1; -- create a table space in the partition group CREATE TABLESPACE ts_group1 IN DATABASE PARTITION GROUP group1 MANAGED BY AUTOMATIC STORAGE BUFFERPOOL bp_group1 NO FILE SYSTEM CACHING; -- create a table with a distribution key CREATE TABLE customer(id INTEGER, info XML) IN ts_group1 DISTRIBUTE BY HASH (id); Figure 3.36 Creating storage objects in a partitioned database A distribution key consists of one or multiple relational columns and cannot contain a LOB or XML column. A table cannot be distributed by element or attribute values in an XML column. You can, however, extract element or attribute values from XML documents, store them in a separate non-XML column, and use that column as the distribution key. Any unique key or primary key of the table must contain all the distribution key columns. Since XML values are not allowed in a distribution key, you cannot define unique XML indexes on a table that is distributed across database partitions. The columns of the distribution key must be included in the columns that make up any unique constraints. 3.11 XML STORAGE IN DB2 FOR Z/OS Many of the concepts that are explained in the previous sections for DB2 on Linux, UNIX, and Windows also apply to DB2 9 for z/OS. For example, the representation of XML documents in a parsed tree format, replacing tag names with unique integer values, breaking large document trees into regions, as well as the use of the XML data type to define columns in a table—all these concepts are used in DB2 for z/OS as well. There are, however, differences in how these concepts are implemented to fit well with table spaces and other infrastructure in DB2 for z/OS. These differences are explained in this section.
3.11 XML Storage in DB2 for z/OS 3.11.1 61 Storage Objects for XML Data In DB2 9 for z/OS you can define XML columns in a table as easily as in DB2 for Linux, UNIX, and Windows: CREATE TABLE customer (id INTEGER, info XML) Additionally, XML indexes can be defined on the XML column. The customer table appears to applications as shown in Figure 3.37. An application sees exactly the columns that you defined, including the XML column. Applications do not see or need to know that the physical storage of this table’s data is different from its logical appearance. XML index (user defined) B+tree id int Figure 3.37 info XML Logical view of a table with XML column and XML index When a user table contains an XML column, an additional hidden column called DB2_ GENERATED_DOCID_FOR_XML of type BIGINT is automatically generated in addition to the XML column. This column holds a unique identifier for the XML columns in a row. There is a single DB2_GENERATED_DOCID_FOR_XML column even if the table contains multiple XML columns. An index called the document ID index (or, DocID index) is automatically created on this column. The column DB2_GENERATED_DOCID_FOR_XML is not included in the result set of a “SELECT *” statement and should be considered as a DB2 internal column. The following explains how this column is used by DB2 internally. Similarly to DB2 for Linux, UNIX, and Windows, XML documents are not stored directly in the XML column that you define in your user table. Instead, a separate internal XML table in its own table space is created for each XML column in the user table (base table). The internal XML table in DB2 for z/OS serves a similar purpose as the XDA object in DB2 for Linux, UNIX, and Windows, that is, XML documents are stored outside of the rows of the user table so that documents up to 2GB in size can be managed. A single document can be physically split across multiple rows in the XML table, but logically it belongs to a single row in the base table. These concepts are illustrated in Figure 3.38.
62 Chapter 3 Designing and Managing XML Storage Objects The internal XML table consists of three columns—DOCID (BIGINT), MIN_NODEID (VARCHAR), and XMLDATA (VARBINARY). The DOCID column is used internally to automatically join the XML table with the DB2_GENERATED_DOCID_FOR_XML column in the base table. The XMLDATA column contains regions of XML documents in the pureXML parsed tree format, not in textual format. Therefore the column type is VARBINARY. In any given row, the MIN_NODEID column provides the lowest nodeID value of all nodes in the region that is stored in the XMLDATA column of the same row. This helps DB2 to process documents efficiently, even if they span multiple regions (multiple rows) in the internal XML table. The internal table is clustered by DOCID and MIN_NODEID. Hence the regions of a large document are physically always stored in their logical consecutive order. A so-called NodeID index is defined on the internal XML table. Its key consists of the DOCID value plus the largest nodeID of the region in the current row. The NodeID index allows for efficient access from a base table row to the corresponding document regions in the internal XML table. These regions comprise the document that logically belongs to the base table row. When you define an XML index on the XML column of your table, DB2 creates a physical B-Tree index on the XMLDATA column in the internal XML table. The XML table is where the actual XML data resides that is being indexed. When DB2 uses your XML index to evaluate an XML predicate, it retrieves DOCID values from the internal XML table to join back to the base table. For this reason, the base table has a DocID index on its DB2_GENRATED_DOCID_FOR_XML column (see Figure 3.38). Internal DocID index Internal NodeID index B+tree B+tree XML index (user defined) B+tree User Table DB2_GENERATED_ DOCID_FOR_XML Figure 3.38 id int info XML Internal XML Table DocID min_NodeID XMLData Base table and XML table column relationships Note that the XML column in the base table has the name (info) that was specified in the CREATE TABLE statement. To speed up the search for specific XML elements, you can define XML indexes on this info column, as shown in Figure 3.37. Internally, this results in a B-tree index over the XMLDATA column in the internal XML table. XML indexes are discussed in further detail in Chapter 13, Defining and Using XML Indexes.
3.11 XML Storage in DB2 for z/OS 3.11.2 63 Characteristics of XML Table Spaces The page size of an internal XML table space is always 16KB, regardless of the page size of the base table space. The internal XML table space therefore uses the buffer pool BP16K0 by default. There is a new DSNZPARM subsystem parameter called TBSBPXML, which you can set to specify the default buffer pool to use for XML table spaces if you do not want to use BP16K0. The storage structure and partitioning scheme of an internal XML table space depends on the storage structure of the table space for the base table. If the base table resides in a simple, segmented, or partition-by-growth (PBG) table space, then the XML table space is automatically a partition-by-growth table space. An internal XML table space is never a simple or segmented table space. If the base table space is partitioned or range-partitioned, then the internal XML table space is also partitioned or range-partitioned, respectively. For example, if the base table space is partitioned into two parts, then the base table and the internal XML table also consist of two parts each. The DocID index and the NodeID index are always created as non-partitioned indexes (NPIs). Table 3.1 summarizes the relationship between the storage structure of a base table space and the storage structure of an internal XML table space that is implicitly created for an XML column in the base table. Table 3.1 Table Space Types for a Base Table and Internal XML Table Base Table Space Internal XML Table Space Simple Partition-by-growth Segmented Partition-by-growth Partition-by-growth Partition-by-growth Partitioned Range-partitioned Range-partitioned Range-partitioned The internal XML table and table space also inherit certain attributes from the base table and table space, such as TRACKMOD, ERASE, and LOCKMAX. In particular, an internal XML table space inherits the COMPRESS YES parameter from the base table space, which allows compression of XML data along with the other data in the same table. The COMPRESS attribute for an internal XML table space can be altered by the ALTER TABLESPACE statement.
64 Chapter 3 3.11.3 Designing and Managing XML Storage Objects Tables with Multiple XML Columns Let’s look at an example of a base table with two XML columns. Assume that the following ALTER TABLE statement is used to add a second XML column (hist) to our customer table: ALTER TABLE customer ADD COLUMN hist XML The resulting storage objects are shown in Figure 3.39. The two XML columns are represented by two internal XML tables, each consisting of three columns and with the appropriate indexes, as described earlier. Each of the two XML tables resides in its own table space. The base table has a single DB2_GENERATED_DOCID_FOR_XML column, which allows DB2 to join back from both of the two internal XML tables to the base table. B+tree B+tree Internal XML Table (info) DocID min_NodeID XMLData Internal DocID index B+tree User Table DB2_GENERATED_ DOCID_FOR_XML id int info XML hist XML B+tree B+tree Internal XML Table (hist) DocID min_NodeID Figure 3.39 3.11.4 XMLData Storage objects for a non-range-partitioned base table with two XML columns Naming and Storage Conventions When you create a table, you can specify a number of options, such as the database name, table space name, and storage group. Depending on which of these values are provided, the implicitly created internal XML table and indexes inherit certain attributes and/or use generated values.
3.12 Utilities for XML Objects in DB2 for z/OS 65 Table 3.2 shows how certain attributes of the base table, XML table, and XML indexes are determined for a simple CREATE TABLE statement. Table 3.3 shows the corresponding information for a table definition with an explicit database and table space name. Table 3.2 Explicit and Generated Attributes for XML Objects CREATE TABLE customer (id INTEGER, info XML) Name Database Table Space Index Space Storage Group Base Table customer generated generated SYSDEFLT XML Table generated same as base table generated SYSDEFLT DocID Index generated generated SYSDEFLT generated generated SYSDEFLT from create index stmt generated from create index stmt, or (base table) NodeID Index (XML table) User-defined XML Indexes SYSDEFLT Table 3.3 Explicit and Generated Attributes for XML Objects CREATE TABLE customer (id INTEGER, info XML) IN mydb.myts Name Database Table Space Index Space Storage Group Base Table customer mydb myts from database XML Table generated mydb generated from database DocID Index generated generated from database generated generated from database from create index stmt generated from create index stmt, or derived from the database (base table) NodeID Index (XML table) User-defined XML Indexes 3.12 UTILITIES FOR XML OBJECTS IN DB2 FOR Z/OS The XML storage objects discussed in the previous section are supported by all relevant utilities in DB2 for z/OS. Table 3.4 provides an overview of the utility support for XML objects.
66 Table 3.4 Chapter 3 Designing and Managing XML Storage Objects Utility Support for XML Objects Utility Description CHECK DATA Checks the relationships between a base table with XML columns and the internal XML tables. The utility reports an error if it detects any inconsistencies. Optionally the utility can also set the status of an XML column to invalid if inconsistencies are found. CHECK INDEX Checks the internal DocID and NodeID indexes as well as any user-defined XML indexes on an XML column. COPY TABLESPACE Allows you to produce full or incremental image copies of a table space that contains a base table with an XML column. DB2 does not automatically copy the related XML table space with the internal XML table or any XML indexes. You have to specify the XML table space explicitly in the COPY TABLESPACE command. You also need to specify the index space or index names of any indexes that you want to be copied. The options SHRLEVEL REFERENCE, SHRLEVEL CHANGE, and CONCURRENT are supported for copying XML table spaces and indexes. COPY INDEX Supports taking full image copies and concurrent copies of the DocID and NodeID indexes as well as any user-defined XML indexes. COPYTOCOPY Allows you to copy existing copies of XML objects, such as XML table spaces, DocID and NodeID indexes, and any user-defined XML indexes. LISTDEF When you create object lists, you can select whether you want to include XML table spaces and indexes. By default, such XML objects are included. The new keyword XML allows you to list XML objects only. LOAD The LOAD utility supports loading XML data into XML columns. Further details are provided in Chapter 5, Moving XML Data. MERGECOPY The MERGECOPY utility merges multiple incremental copies of a table space into a single incremental copy. It can also merge incremental copies with a full copy into a single new full copy. XML table spaces are fully supported. QUIESCE TABLESPACESET When you use QUIESCE TABLESPACESET to quiesce a table space that contains an XML column, any related internal XML table spaces and XML index spaces are automatically included in the set of quiesced objects. REAL TIME STATISTICS Collects statistics on XML objects. REBUILD INDEX Supports rebuilding DocID and NodeID indexes as well as any user-defined XML indexes. The option SHRLEVEL CHANGE is not allowed for XML indexes. RECOVER TABLESPACE, RECOVER INDEX These utilities support recovery of the XML table spaces and XML indexes. If you perform point-in-time recovery, you should recover all related XML objects (table spaces and indexes) to the same point in time.
3.12 Utilities for XML Objects in DB2 for z/OS Table 3.4 67 Utility Support for XML Objects (Continued) Utility Description REORG TABLESPACE Allows you to reorganize a base table space as well as all related XML table spaces. The internal XML table spaces are not included automatically but must be explicitly specified. Options such as SHRLEVEL CHANGE are supported. REORG INDEX Supports reorganization of XML indexes. REPAIR Supports XML table spaces and XML indexes. REPORT TABLESPACESET All XML objects, such as XML table spaces and indexes are included in the set of reported objects. RUNSTATS TABLESPACE, RUNSTATS INDEX The RUNSTATS utility gathers statistics for the base table space, XML table spaces, DocID and NodeID indexes, and any user-defined XML indexes. UNLOAD Allows you to unload tables with XML columns. You only have to specify the base table space and do not have to specify the internal XML table space explicitly. You cannot unload XML data from a copy. To ensure portability of unloaded XML data, you should specify the UNICODE keyword and use Unicode delimiter characters. The UNLOAD utility adds an XML declaration with an encoding attribute to each XML document unless you unload the data in UTF-8 CCSID 1208. For further detail on XML encodings, see Chapter 20, Understanding XML Data Encoding. 3.12.1 REPORT TABLESPACESET for XML You can identify the relationship between base tables and their related XML tables and XML indexes by running the REPORT TABLESPACESET command. Figure 3.40 shows the output of this command for the following table: CREATE TABLE customer(id INT, info XML, info2 XML) REPORT TABLESPACESET DSN00191.CUSTOMER TABLESPACE SET REPORT: TABLESPACE TABLE INDEXSPACE INDEX : DSN00191.CUSTOMER : USER011.CUSTOMER : DSN00191.IRDOCIDC : USER011.I_DOCIDCUSTOMER XML TABLESPACE SET REPORT: TABLESPACE : DSN00191.CUSTOMER BASE TABLE : Figure 3.40 USER011.CUSTOMER Running the REPORT TABLESPACESET command (Continues)
68 Chapter 3 COLUMN : XML TABLESPACE : XML TABLE : XML NODEID INDEXSPACE: XML NODEID INDEX : COLUMN : XML TABLESPACE : XML TABLE : XML NODEID INDEXSPACE: XML NODEID INDEX : Figure 3.40 3.12.2 Designing and Managing XML Storage Objects INFO DSN00191.XCUS0000 USER011.XCUSTOMER DSN00191.IRNODEID USER011.I_NODEIDXCUSTOMER INFO2 DSN00191.XCUS0001 USER011.XCUSTOMER000 DSN00191.IRNO1UKC USER011.I_NODEIDXCUSTOMER000 Running the REPORT TABLESPACESET command (Continued) Reorganizing XML Data in DB2 for z/OS You can use the DB2 for z/OS REORG utility to reorganize table spaces with XML data. You have the option of either reorganizing just the base table space, or reorganizing just the internal XML table spaces or both. You must explicitly list the table spaces that you want to reorganize. If you are reorganizing XML table spaces, then you must specify the keyword WORKDDN and provide the specified temporary work file (the default work file is SYSUT1). An example is shown in Figure 3.41, where the REORG statement specifies that DB2 is to reorganize the base table space CUSTOMER and XML table spaces XCUS0000 and XCUS0001. The names of XML table spaces are known from running the REPORT TABLESPACESET command. The command options in this reorganization tell DB2 to take an inline copy of the base table space and gather statistics for all table spaces. Note, that the following options are not allowed in the REORG statement for XML table spaces and base table spaces with XML columns: DISCARD, REBALANCE or UNLOAD EXTERNAL. //REORG EXEC DSNUPROC,UID='KUMARP2.REORG',TIME=1440, // UTPROC='', // SYSTEM='ISC9',DB2LEV=DB2A //SYSREC DD DSN=KUMARP2.REORG.SYSREC, // DISP=(MOD,DELETE,CATLG), // UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSCOPY1 DD DSN=KUMARP2.REORG.SYSCOPY1, // DISP=(MOD,CATLG,CATLG),UNIT=SYSDA, // SPACE=(4000,(20,20),,,ROUND) //SYSUT1 DD DSN=KUMARP2.REORG.SYSUT1, // DISP=(MOD,DELETE,CATLG), // UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSUT2 DD DSN=KUMARP2.REORG.SYSUT2, // DISP=(MOD,DELETE,CATLG), // UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSIN DD * REORG TABLESPACE DSN00191.CUSTOMER COPYDDN(SYSCOPY1) Figure 3.41 Reorganizing XML tables in DB2 for z/OS (Continues)
3.12 Utilities for XML Objects in DB2 for z/OS 69 STATISTICS TABLE(ALL) INDEX(ALL) REORG TABLESPACE DSN00191.XCUS0000 STATISTICS TABLE(ALL) INDEX(ALL) WORKDDN(SYSUT1) REORG TABLESPACE DSN00191.XCUS0001 STATISTICS TABLE(ALL) INDEX(ALL) WORKDDN(SYSUT2) /* Figure 3.41 Reorganizing XML tables in DB2 for z/OS (Continued) You can also use the LISTDEF utility to group the related table spaces together into a list and then specify that list in the REORG statement. 3.12.3 CHECK DATA for XML You can use the CHECK DATA utility to check the consistency between a base table with XML columns and the related XML tables. If the base table space is not consistent with any related XML table spaces, CHECK DATA reports an error. The default behavior of CHECK DATA is to check both the base table and the XML table spaces. However, you can add the keywords SCOPE REFONLY to check base tables only, or the keywords SCOPE AUXONLY to check only LOB and XML objects. In its simplest form, as shown in Figure 3.42, CHECK DATA checks the XML relationships, the LOB relationships, and the base table space. CHECK DATA TABLESPACE DSN00191.CUSTOMER DSNUGUTC - CHECK DATA TABLESPACE DSN00191.CUSTOMER XMLERROR REPORT .02 DSNUKINP - TABLESPACE 'DSN00191.CUSTOMER' IS NOT CHECK PENDING Figure 3.42 Simple example of the CHECK DATA utility Note that the table space name is the name of the base table space, not the name of the internal XML table space. If the table space is not in check pending state, then the XML part of the table space is okay. Optionally you can also ask the CHECK DATA utility to invalidate any XML or LOB columns that it finds to be inconsistent. The appropriate keywords are shown in Table 3.5.
70 Table 3.5 Chapter 3 Designing and Managing XML Storage Objects CHECK DATA Error Keywords Column in Error Action Taken by CHECK DATA Keyword XML column Report the error only XMLERROR REPORT Report the error and set the column in error to an invalid status XMLERROR INVALIDATE Report the error only LOBERROR REPORT Report the error and set the column in error to an invalid status LOBERROR INVALIDATE Report the error only AUXERROR REPORT Report the error and set the column in error to an invalid status AUXERROR INVALIDATE LOB column XML or LOB column The keywords XMLERROR REPORT imply that an XML column check error is reported with a warning message, and the base table space is set to the auxiliary check-pending status (ACHKP) . The keywords XMLERROR INVALIDATE imply that an XML column check error is reported along with a warning message, and the base table XML column is set to an invalid status. If an XML column had an invalid status and is now correct, it is set to valid. As an example, if you want to limit the scope of the utility to just checking XML and LOB columns and reporting errors, you can use the job shown in Figure 3.43. //CHECK EXEC DSNUPROC,PARM='ISC9,CHCKUT',COND=(4,LT) //SORTOUT DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSUT1 DD UNIT=SYSDA,SPACE=(4000,(50,50),,,ROUND) //SYSERR DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSIN DD * CHECK DATA TABLESPACE DSN00191.CUSTOMER SCOPE AUXONLY XMLERROR REPORT /* 56.17 DSNUGUTC - OUTPUT START FOR UTILITY, UTILID = IANTEX 56.20 DSNUGTIS - PROCESSING SYSIN AS EBCDIC 56.22 DSNUGUTC - CHECK DATA TABLESPACE DSN00180.CUSTOMER SCOPE AUXONLY XMLERROR REPORT 56.34 DSNUKDST - CHECKING TABLE KUMARP2.CUSTOMER 56.46 DSNUGSOR - SORT PHASE STATISTICS NUMBER OF RECORDS=3 ELAPSED TIME=00:00:00 56.47 DSNUKDAT - CHECK TABLE KUMARP2.CUSTOMER COMPLETE, ELAPSED TIME=00:00:00 56.50 DSNUK001 - CHECK DATA COMPLETE,ELAPSED TIME=00:00:00 56.51 DSNUGBAC - UTILITY EXECUTION COMPLETE, HIGHEST RETURN CODE=0 Figure 3.43 Running the CHECK DATA utility
3.13 XML Parsing and Memory Consumption in DB2 for z/OS 3.13 71 XML PARSING AND MEMORY CONSUMPTION IN DB2 FOR Z/OS DB2 for z/OS has two parameters that allow you to limit the amount of memory used for XML operations. DB2 for z/OS has also unique capabilities to offload XML parsing to System z Application Assist Processors (zAAP) and System z Integrated Information Processors (zIIP). 3.13.1 Controlling the Memory Consumption of XML Operations Unlike DB2 for Linux, UNIX, and Windows, DB2 for z/OS allows you to limit the amount of DB2 memory that is used for XML processing. For this purpose there are two new DSNZPARM subsystem parameters: XMLVALA and XMLVALS (in the macro DSN6SYSP XMLVALA). Since XML columns are defined without a maximum size per row, DB2 cannot estimate the amount of memory that it needs for processing SQL/XML and XPath queries before execution. If queries construct very large XML documents, the amount of memory that DB2 requires can grow very large. You can set the XMLVALA and XMLVALS subsystem parameters to limit memory consumption. • XMLVALA specifies the upper limit for the amount of memory (in KB) that each user (thread) can use for processing XML data. The default is 204800 KB (200MB), the maximum value is 2GB. The recommended value for this parameter is at least four times the largest expected XML document size. The default value of 200MB is sufficient for most applications. • XMLVALS specifies the upper limit for the amount of memory (in MB) that the entire subsystem can use for processing XML data. The default is 10240 MB (10GB), the maximum value is 50GB. The recommended value is the maximum number of concurrent threads multiplied by the value of XMLVALA. When the system exceeds the maximum memory allowed per user or per system, the violating SQL statement fails with SQLCODE -904. To track the memory usage for XML values and prevent the SQLCODE -904 from happening, the peak memory usage for XML has been added to DB2 statistic record IFCID 2, DB2 accounting record IFCID 3, and the DB2 monitor trace record IFCID 148. The DB2 statistic record provides the per-system peak memory usage for XML. The DB2 accounting record and DB2 monitor trace records provide the per-user peak memory usage for XML. The DSNZPARM subsystem parameters LOBVALA and LOBVALS are applicable to XML processing if you move very large XML documents from your application to the DB2 server or in the opposite direction (bind in and bind out). The general recommendation is to set LOBVALA to the largest expected XML document size, and to set LOBVALS to the maximum number of concurrent threads multiplied by the value of LOBVALA.
72 Chapter 3 3.13.2 Designing and Managing XML Storage Objects Redirecting XML Parsing to zIIP and zAAP XML insert, update, and load operations in DB2 for z/OS require XML parsing, which is performed by the z/OS XML System Services (XMLSS). XML System Services provides a systemlevel XML parser that is integrated in the base z/OS operating system. It can be used by system components, middleware, and applications that need efficient XML parsing services. For further information on XMLSS, see http://www.ibm.com/servers/eserver/zseries/zos/xml/. XML parsing can be offloaded to zAAP and zIIP as follows: • XML parsing can be redirected to zIIP and zAAP processors in z/OS V1.10. • XML parsing can be redirected to zAAP processors in z/OS V1.9. • With z/OS APAR OA20308, XML parsing is eligible for zAAP also in z/OS V1.8 and V1.7. • With z/OS APAR OA23828, XML parsing is eligible for zIIP also in z/OS V1.9 and V1.8. When you insert, update, or load XML data in DB2 for z/OS, only the XML parsing portion of the processing is eligible for offloading. Depending on the size and complexity of the XML documents, between 10% and 50% of the total CPU time is spent on XML parsing and eligible for offloading. Larger documents lead to a larger percentage of CPU consumption in the XML parser. For further details, see http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/ WP101227. Several important fixes and functional enhancements for pureXML in DB2 9 for z/OS have been delivered via APARs. Table 3.6 lists some of the recommended APARs. For current information on the latest APARs, please look at APAR II14426 and visit http://www.ibm.com/software/data/ db2/support/db2zos/. Table 3.6 Relevant APARs for XML processing in DB2 9 for z/OS APAR Description PK47594 XML load performance improvement PK51571, PK51572, PK51573, PK58914, PK57409 XMLTABLE support (see Chapter 7) PK57158 XML index access path improvement PK50575 zAAP accounting PK55585, PK55831 Additional 13 XPath functions PK50692 Change PGR DSSIZE for XML table space parts PK55783 XML index exploitation for joins
3.14 Summary 73 Table 3.6 Relevant APARs for XML processing in DB2 9 for z/OS (Continued) APAR Description PK68265 XML locking improvement II14426 Info APAR to link together all pureXML-related APARs 3.14 SUMMARY Although XML has a widely known character-based notation, it is also a hierarchical data model. Every XML document can be represented as a tree of nodes. When you insert or load XML documents into an XML column in a DB2 table, the documents are parsed into a tree format. Such document trees are stored on pages in a table space. As a result, DB2 can query XML data without XML parsing, which is a critical performance benefit. A table can have both XML and relational columns, but by default XML data is physically stored in a separate storage object. Internal system indexes provide close linkage between the XML documents and the relational rows that they belong to. These storage concepts are transparent to the user applications, which only see logical tables that contain relational and XML columns. Since XML documents are stored in pages in a table space, most DB2 functions and features apply to XML data as for relational data. For example, buffering in the buffer pool, logging and recovery, backup and restore, reorganization to reclaim free space, optional compression or partitioning of tables, and most database utilities are supported for XML and relational data in an integrated manner. Therefore, the database administrator can apply any existing DB2 knowledge also to the management of XML data.
This page intentionally left blank
C H A P T E R 4 Inserting and Retrieving XML Data n this chapter we discuss “full document” operations, such as insert, delete, and retrieval of whole XML documents in DB2 tables. These are the most basic and most common operations for XML documents. Insert and retrieval operations are of particular interest. They involve the conversion of textual XML data to DB2’s internal XML format upon insert, and the reverse upon retrieval. This conversion from and to the character representation of XML requires handling of XML declarations, whitespace, and reserved characters. DB2 handles these matters automatically for you, but some options exist that allow you to customize its behavior if needed. I In this chapter you learn • How to insert XML documents from application programs and the DB2 Command Line Processor (section 4.1) • How to utilize user-defined functions to read documents from the file system (section 4.1.2) • How to delete, retrieve, and copy XML documents in DB2 tables (sections 4.2 through 4.5) • How to handle and escape reserved characters in XML documents (section 4.6) • How to recognize different kinds of whitespace that can appear in XML documents and what it means for XML document storage and retrieval (section 4.7) We use the following sample table for the examples in this chapter: CREATE TABLE shelf(id INTEGER, bookinfo XML) 75
76 4.1 Chapter 4 Inserting and Retrieving XML Data INSERTING XML DOCUMENTS XML documents can be placed into an XML column of a DB2 table with SQL INSERT statements or with the LOAD and IMPORT utilities, which are discussed in Chapter 5, Moving XML Data. XML documents that already exist in a table can be replaced or modified using SQL UPDATE statements, which we cover in Chapter 12, Updating and Transforming XML Documents. In this section we focus on SQL INSERT statements to add XML documents to a table. An XML column in DB2 can only contain well-formed XML documents. A document is wellformed if it complies with the syntax rules for XML documents that we explained in section 1.1, Anatomy of an XML Document. When you add an XML document to an XML column, DB2 invokes an XML parser which, among other things, verifies whether the document is wellformed. Documents that are not well-formed are rejected because they cannot be reliably processed. You can, however, insert non well-formed documents as plain text into CLOB or VARCHAR columns. When you insert, load, or update XML documents you can optionally validate the documents with an XML Schema (see Chapters 16, Managing XML Schemas and 17, Validating XML Documents against XML Schemas). 4.1.1 Simple Insert Statements You can insert an XML document into a table with SQL INSERT statements either via an API from an application or from the DB2 Command Line Processor (CLP). The CLP is available for DB2 for Linux, UNIX, and Windows and DB2 for z/OS. Other command interfaces such as SPUFI or the Command Editor in the DB2 Control Center can also be used. We first look at INSERT statements issued from the CLP, then at INSERT statements used through APIs. When you insert an XML document through a command interface such as the CLP, the entire document needs to be hardcoded in the INSERT statement as a string literal or read from the file system with a user-defined function (UDF). Let’s look at the sample document in Figure 4.1. We have added line breaks and indentation to make it easier to read. <?xml version="1.0" encoding="UTF-8" ?> <bookstore> <book type="database"> <isbn>0131580183</isbn> <title>Understanding DB2</title> <author>Raul Chong</author> </book> </bookstore> Figure 4.1 Document to be inserted into DB2 The statement that you can issue from the CLP to insert the document in Figure 4.1 into the shelf table is shown in Figure 4.2. Note that the document is provided as a string value in the INSERT statement. This string value has to be enclosed in single quotes.
4.1 Inserting XML Documents 77 INSERT INTO shelf VALUES(4,'<?xml version="1.0" encoding="UTF-8" ?> <bookstore> <book type="database"> <isbn>0131580183</isbn> <title>Understanding DB2</title> <author>Raul Chong</author> </book> </bookstore>') Figure 4.2 Inserting an XML document from the DB2 CLP If an XML document contains a single quote in any of its element or attribute values (see Figure 4.3), then such quotes conflict with the quotes that the INSERT statement requires to mark the beginning and end of the document. The single quote in the word “Don't” in the title element would be interpreted as the closing quote and end of the string value that represents the document. This misinterpretation of the quote would lead to an error. <?xml version="1.0" encoding="UTF-8" ?> <bookstore> <book type="general"> <isbn>0708823904</isbn> <title>Don't cry for me sergeant major</title> <author>Robert McGowan</author> </book> </bookstore> Figure 4.3 Another document to be inserted into DB2 To avoid an error you should escape the single quote by using either two single quotes (Don''t) or the predefined entity &apos;. Figure 4.4 shows an INSERT statement with correct escaping of the single quote. The single quote does not need to be escaped if you insert the document from an application program and provide the document via a parameter marker or host variable. Escaping special characters is discussed in more detail in section 4.6. INSERT INTO shelf VALUES(5,'<?xml version="1.0" encoding="UTF-8" ?> <bookstore> <book type="general"> <isbn>0708823904</isbn> <title>Don&apos;t cry for me sergeant major</title> <author>Robert McGowan</author> </book> </bookstore>') Figure 4.4 Escaping single quotes ensures correct XML insertion from the CLP
78 Chapter 4 Inserting and Retrieving XML Data The default processing mode for INSERT statements strips non-relevant whitespace from the XML document. For example, the line breaks and indentation that you see in Figure 4.4 are removed upon insert. You do not get them back when you retrieve the document, which is acceptable and actually desirable for most applications. Whitespace is typically not meaningful to an application that processes the XML data, and removing whitespace saves storage space. If you use digital signatures then it depends on the software that signs and verifies XML documents whether the removal of whitespace affects the digital signatures. Although the XML signature standard (http://www.w3.org/TR/xmldsig-core/) allows for whitespace to be removed, not every signature software might be implemented that way. In case an application requires the preservation of whitespace, DB2 offers several options to do so. They are covered in section 4.7. Hard-coding an XML document in an SQL INSERT statement is only feasible for simple tests with individual documents that are very small. In most other cases it is better to insert XML documents from a variable in your application code, or from files in the file system. Inserting XML from an application program typically uses parameter markers or host variables, as shown in Figure 4.5 and Figure 4.6. In these examples, the first parameter marker or host variable must be an INTEGER value for the id column of the table shelf, and the second must be an XML document for the XML column bookinfo. INSERT INTO shelf(id, bookinfo) VALUES(?,?) Figure 4.5 Inserting an XML document using parameter markers INSERT INTO shelf(id, bookinfo) VALUES(:hostvar1, :hostvar2) Figure 4.6 Inserting an XML document using host variables Figure 4.7 shows a code snippet of a Java application that reads an XML document from a file bookfile.xml, and inserts it into the XML column bookinfo in the table shelf. Additional application code samples for various host languages and APIs are presented in Chapter 21, Developing XML Applications with DB2.
4.1 Inserting XML Documents 79 PreparedStatement insertStmt = null; String sqls = null; int id = 4; File file = new File("bookfile.xml"); sqls = "INSERT INTO db2admin.shelf(id, bookinfo) VALUES (?, ?)"; insertStmt = conn.prepareStatement(sqls); insertStmt.setInt(1, id); insertStmt.setBinaryStream(2, new FileInputStream(file), (int)file.length()); insertStmt.executeUpdate(); Figure 4.7 Inserting an XML document from a JDBC application If the XML documents you want to insert are located in files in a file system, you have two additional options: • Use the DB2 IMPORT or LOAD utilities. See Chapter 5. • Use a user-defined function (UDF) to read XML documents from files, which is explained in the next section. 4.1.2 Reading XML Documents from Files or URLs In DB2 for Linux, UNIX, and Windows you can use a set of convenient user-defined functions (UDFs) to read XML documents from files or URLs. These UDFs do not come as part of a regular DB2 installation, but are available from the IBM developerWorks website at http://www.ibm. com/developerworks/exchange/dw_entryView.jspa?externalID=635&categoryID=974. The download package consists of 11 UDFs and one stored procedure, listed in Table 4.1. They enable you to read XML documents from files, directories, directory trees, URLs, and ZIP files. These functions provide a lot of flexibility in many situations. If you need to populate tables with the greatest possible performance, also consider using the DB2 LOAD utility. Table 4.1 List of UDFs Function Name Description blobFromFile Reads a file from the DB2 server’s file system and returns the file contents as a BLOB. If this BLOB contains a well-formed XML document, it can be inserted into an XML column. clobFromFile Reads a file from the DB2 server’s file system and returns the file contents as a CLOB. If this CLOB contains a well-formed XML document, it can be inserted into an XML column. (continues)
80 Table 4.1 Chapter 4 Inserting and Retrieving XML Data List of UDFs (Continued) Function Name Description clobFromURL Returns a CLOB from a URL. blobFromURL Reads a BLOB from a URL. blobsFromZipURL, clobsFromZipURL Table functions that read a ZIP file from a URL and return a table that contains each file from the ZIP in a separate row, as a BLOB or CLOB, respectively. blobsFromGzipURL, clobsFromGzipURL Table functions that read a gzipped tar file from a URL and return a table that contains each file from the tar archive in a separate row, as a BLOB or CLOB, respectively. directoryInfo Returns a table with information about files in a directory in the DB2 server’s file system. directoryInfoRecursive Returns a table with information about files in a directory and its subdirectories. urlFromFile Returns a URL from a local file name. insertXmlFromDir Stored procedure that inserts all files with extension .xml and .XML from a given directory into a DB2 table. Many of these UDFs come in two versions that produce either a CLOB or a BLOB. Both types of functions, CLOB and BLOB, can be used to insert XML documents into an XML column. However, inserting XML documents from a BLOB type is preferable to avoid code page conversion issues. If you use a CLOB function to insert XML documents into an XML column, you might introduce unnecessary code page conversion and might damage the data (see Chapter 20, Understanding XML Data Encoding). If you want to insert a file into a BLOB or VARCHAR FOR BIT DATA column, you must use the BLOB version of the functions. If you want to insert a text file from the file system into a CLOB or VARCHAR column, you must use a CLOB function. The UDFs come in a package that needs to be installed with the following command: db2 -td% -f XMLFromFile.clp The script file XMLFromFile.clp in the package must be edited before running so that it contains the correct path to the correct directory where the package has been unpacked and where, as a result, the file XMLFromFile.jar is located.
4.1 Inserting XML Documents 81 Let’s insert an XML document that is contained in the file C:\xml\book\book01.xml. The INSERT statement in Figure 4.8 illustrates how to use the function blobFromFile to read the file and use it as input for the INSERT statement. INSERT INTO shelf(id, bookinfo) VALUES(7,blobFromFile('c:\xml\book\book01.xml')) Figure 4.8 Inserting a file with the blobFromFile function If the UDF cannot find the specified file, the INSERT statement fails with the following error: SQL0443N Routine "*FROMFILE" (specific name "") has returned an error SQLSTATE with diagnostic text "java.io.FileNotFoundException: c:\xml\book\book01.xml". SQLSTATE=38A00 The UDF directoryInfo is a table function and enables you to list the files in a specified directory. The function returns a table with one row for each file, and with columns for the filename, size, timestamp, and other file attributes. You can write the query in Figure 4.9 to list file information for the directory c:\xml\book. SELECT filename, size, modtime FROM TABLE(directoryInfo('c:\xml\book')); Figure 4.9 Query to list files in a directory The function directoryInfo is particularly useful because it allows you to insert selected files from a directory into a DB2 table. For example, the INSERT statement in Figure 4.10 uses the directoryInfo function to read all files from the directory c:\xml that match the pattern book%.xml and inserts them into an XML column. You can filter on file names or other file attributes, as shown in the WHERE clause. INSERT INTO shelf(bookinfo) SELECT blobFromFile(filename) FROM TABLE(directoryInfo('c:\xml')) WHERE isDirectory = 0 AND filename LIKE 'book%.xml' Figure 4.10 Inserting selected XML files from a directory If the XML documents that you want to insert are bundled in a ZIP file, then you can use the table function blobsFromZipURL to extract the files and insert them into the target table. This table function returns the columns FILENAME, SIZE, COMPRESSEDSIZE, MODTIME, COMMENT and DOC. You can use these columns to select specific documents for insertion, as in Figure 4.11.
82 Chapter 4 Inserting and Retrieving XML Data Note that the ZIP file does not get unzipped in the file system. Instead, files are extracted and inserted straight from the ZIP file. INSERT INTO shelf(bookinfo) SELECT doc FROM TABLE(blobsFromZipURL(urlFromFile('c:\xml\book\allbook.zip'))) WHERE filename LIKE '%.xml'OR filename LIKE '%.XML' Figure 4.11 4.2 Inserting documents from a ZIP file DELETING XML DOCUMENTS There are two ways of removing XML documents from a table: • You can delete entire rows from a table. These rows can be selected with predicates on relational columns in a table or with predicates on XML elements and attributes in the XML columns. • You can delete just the XML document within a row by setting the XML column to NULL. If you want to prevent NULL values in an XML column, declare the column as NOT NULL, as you would for relational columns. Figure 4.12 shows two SQL DELETE statements that remove rows with XML documents from a table. The first DELETE statement removes the rows and documents from the shelf table where the relational column id has the value 4. The second statement deletes all rows where the element isbn in the XML document in the bookinfo column has the value 1851588981. These statements work in both DB2 for z/OS and DB2 for Linux, UNIX, and Windows. DELETE FROM shelf WHERE id = 4 ; DELETE FROM shelf WHERE XMLEXISTS('$c/bookstore/book[isbn="0131580183"]' PASSING bookinfo AS "c") ; Figure 4.12 Deleting rows with XML document You can remove just an XML document (instead of the whole row) by setting the XML column for a particular row to NULL, as in Figure 4.13. You cannot remove the XML document by assigning the empty string to the XML column (SET bookinfo =''). The empty string is not a wellformed XML document and is therefore rejected.
4.3 Retrieving XML Documents 83 UPDATE shelf SET bookinfo = NULL WHERE id = 3 Figure 4.13 4.3 Setting an XML column to NULL RETRIEVING XML DOCUMENTS The section describes how to retrieve full XML documents. In Chapters 6 through 9 on querying XML data you learn how to retrieve or filter on individual elements or attributes in XML documents. The easiest way to retrieve whole XML documents is to include an XML column name in the SELECT list of an SQL query. The example in Figure 4.14 selects the relational column id and the XML column bookinfo from the table shelf where the id is less than 10. This query returns two columns that have the data types INTEGER and XML, respectively. SELECT id, bookinfo FROM shelf WHERE id < 10 Figure 4.14 Selecting whole XML documents The query in Figure 4.14 performs an implicit serialization of the XML documents to their textual representation. Implicit serialization means that when XML data is sent to the client, the XML data is automatically converted to serialized (text) format and not returned in DB2’s internal hierarchical format. Implicit serialization does not change the data type, so the serialized XML data is still returned as type XML. When you retrieve data in the DB2 Command Line Processor, the display of LOBs and XML documents is truncated. Only the first 4KB of each XML document are shown. If you want to retrieve larger documents then you need to use the EXPORT command, explained in Chapter 5. In a SELECT statement you can also use the function XMLSERIALIZE to perform explicit serialization. Explicit serialization means that the XML data is returned in text format and explicitly converted to a non-XML data type. In DB2 for z/OS, XMLSERIALIZE allows you to return XML documents as type BLOB, CLOB, and DBCLOB. DB2 for Linux, UNIX, and Windows allows CHAR and VARCHAR as additional target types. Figure 4.15 shows three queries with explicit serialization. The first one returns the serialized XML document as a BLOB data type, and the second as VARCHAR, and the third as a CLOB.
84 Chapter 4 Inserting and Retrieving XML Data SELECT XMLSERIALIZE(bookinfo AS BLOB(100k)) FROM shelf; SELECT XMLSERIALIZE(bookinfo AS VARCHAR(32000)) FROM shelf; SELECT XMLSERIALIZE(bookinfo AS CLOB(1M)) FROM shelf; Figure 4.15 Returning text with different data types Implicit serialization is usually preferred, as it avoids unnecessary code page conversion where possible. Explicit serialization can be useful if your query must return a non-XML data type to the application. Explicit serialization to BLOB or CLOB can be beneficial for large documents, because it allows your application to use LOB locators for retrieval. Retrieving XML data into applications and potential code page issues are discussed in more detail in Chapter 20 and Chapter 21. For an application it is important to know the data types of the columns that it receives. You can use the DESCRIBE command to obtain the column and type information for tables or query result sets. DB2 for Linux, UNIX, and Windows allows you to execute the DESCRIBE command in the Command Line Processor or from an application using the ADMIN_CMD administrative procedure. The DESCRIBE command returns information in result set format that can be processed by the application just like any other SQL result set. In DB2 for z/OS, the statements DESCRIBE TABLE, DESCRIBE OUTPUT, and DESCRIBE CURSOR can be embedded in an application program to read column and type information in an SQL descriptor area (SQLDA). Figure 4.16 demonstrates the DESCRIBE command and its output in the CLP. You can see that the data type of the bookinfo column is XML. The column length is zero because XML documents are stored as trees, which have no notion of length. The maximum document size that can be inserted, loaded, or retrieved is 2GB. DESCRIBE TABLE shelf; Column name -----------ID BOOKINFO Figure 4.16 schema --------SYSIBM SYSIBM Data type name Length Scale Nulls --------------- -------- ----- ----INTEGER 4 0 Yes XML 0 0 Yes Describing a table using the DESCRIBE command in the CLP Figure 4.17 shows how to describe queries that return XML documents. The description of the first query confirms that it returns the documents as type XML. The second query performs explicit serialization, and the DESCRIBE command verifies that the target type is VARCHAR.
4.4 Handling Documents with XML Declarations 85 db2 => DESCRIBE SELECT id, bookinfo FROM shelf; Number of columns: 2 SQL type ------------497 INTEGER 989 XML Type length ----------4 0 Column name --------------ID BOOKINFO Name length ----------2 8 db2 => DESCRIBE SELECT xmlserialize(info as varchar(30000)) AS mydoc FROM customer WHERE cid = 1003; Number of columns: 1 SQL type ------------449 VARCHAR Figure 4.17 4.4 Type length ----------30000 Column name --------------MYDOC Name length ----------5 Describing queries in the CLP HANDLING DOCUMENTS WITH XML DECLARATIONS In Figure 4.1 you saw that our sample document begins with an XML declaration: <?xml version="1.0" encoding="UTF-8" ?> An XML declaration is optional and not required for an XML document to be well-formed. If a document has an XML declaration, the following rules apply: • The XML declaration must be at the very beginning of the document and cannot be preceded by any characters or whitespace. • The declaration must contain the version attribute. DB2 only allows XML version 1.0, which is the only version of XML that is widely used. • Optionally, the declaration can contain an encoding attribute (see Chapter 20). A document with an XML declaration cannot be inserted into an XML column if any of these rules are violated. For example, the VALUES clause in Figure 4.18 contains blanks after the single quote and before the XML declaration, which leads to error SQL16168N. INSERT INTO shelf VALUES(10,' <?xml version="1.0"?>… SQL16168N XML document contained an invalid XML declaration. Reason code ="3". SQLSTATE=2200M. Figure 4.18 Invalid whitespace preceding an XML declaration
86 Chapter 4 Inserting and Retrieving XML Data When a document that contains an XML declaration is stored in DB2, the declaration is not preserved and not part of the stored document. Instead, an XML declaration can optionally be generated and added to each document upon retrieval from the database. The generation of XML declarations is controlled by the application in ways that depend on the API that is used. Chapter 21 provides further details. Independent from the specific API that an application uses, you can always force the generation of an XML declaration by using the XMLSERIALIZE function with the option INCLUDING XMLDECLARATION. Figure 4.19 serves as an example. SELECT XMLSERIALIZE(bookinfo AS CLOB(1M) INCLUDING XMLDECLARATION) FROM shelf; Figure 4.19 Retrieving an XML document with XML declaration If you invoke the Command Line Processor of DB2 for Linux, UNIX, and Windows with the –d option, an XML declaration is added to each XML value that is retrieved, even if the query does not contain the XMLSERIALIZE function. 4.5 COPYING FULL XML DOCUMENTS You can copy full XML documents from one table to another using an INSERT/SELECT statement as you normally would for relational data. In the example in Figure 4.20, a target table (shelf2) is created and the XML documents from shelf are copied into it. CREATE TABLE shelf2 LIKE shelf; -INSERT INTO shelf2(id, info) SELECT(id, info) FROM shelf; Figure 4.20 Manipulating full XML documents using INSERT/SELECT XML data that is moved using the insert/select method remains in DB2’s internal tree format during the entire operation, which means that no XML parsing takes place. Thus, this method of copying XML data is typically more efficient than using the DB2 EXPORT utility followed by the LOAD utility. However, you can perform a LOAD FROM cursor, which requires no EXPORT, avoids logging, and parallelizes the write operations into the target table. Chapter 5 describes the processing of XML documents with the IMPORT, EXPORT, and LOAD utilities.
4.6 4.6 Dealing with XML Special Characters 87 DEALING WITH XML SPECIAL CHARACTERS Element and attribute values can potentially contain characters that have a special meaning in the world of XML. For example, the less-than symbol (<) denotes the beginning of a tag, the ampersand (&) denotes the beginning of an entity reference, and quotes are used to delimit attribute values. If such characters appear in the middle of an element or attribute value they should be escaped to avoid processing errors. For example, this XML element contains the less-than character in its text value: <rule>if a < b then exit(0)</rule> Any XML parser interprets the less-than character as the beginning of the next XML element tag. The parser then throws an error because the subsequent space is not a valid character in an XML element name. The DB2 error is SQL16110N XML syntax error. Expected to find "Element Name". The error means that the document is not well-formed and cannot be processed. To solve this problem, the XML standard includes a set of predefined entities that must be used instead of these reserved characters. For example, the <rule> element should use either the entity reference $lt; or the character reference &#60; instead of the actual less-than symbol: <rule>if a &lt; b then exit(0)</rule> Table 4.2 shows all predefined entity and character references that are available to escape reserved XML characters. Note that these references always start with an ampersand (&) and end with a semicolon (;). It does not matter whether you use entity references or character references to escape special characters. Either way is fine, but most people find the entity references more intuitive. Table 4.2 XML Special Character Substitution Strings XML Special Character Entity Reference ASCII Character Reference ampersand (&) &amp; &#38; single quote (') &apos; &#39; double quote (") &quot; &#34; greater-than symbol (>) &gt; &#62; less-than symbol (<) &lt; &#60;
88 Chapter 4 Inserting and Retrieving XML Data Let’s look at one more example, shown in Figure 4.21. This INSERT statement adds information about another book to the table shelf. The title of the book is Helen's story about foxes & rabbits. However, the ampersand (&) is a reserved character and needs to be represented either by &amp; or by &#38;. Additionally you need to escape the single quote in the title if you want to insert the document through the CLP as a string that must be enclosed in single quotes. INSERT INTO shelf VALUES (4, '<bookstore> <book> <title>Helen&apos;s story about foxes &amp; rabbits</title> </book> </bookstore>') Figure 4.21 Inserting a document with two entity references When you retrieve this document, what will the title look like? The simple test in Figure 4.22 reveals that the entity reference &apos; has been resolved into the actual single quote character. However, the entity reference &amp; has been preserved. The reason for the difference is that the single quote is not a reserved character in XML. A document that has a single quote in an element value is still well-formed and hence there is no need to retain the entity reference after the document has been inserted. The ampersand, however, is a reserved character and always has to be escaped so that the document remains well-formed, which is crucial if your application uses an XML parser to process the documents retrieved from DB2. db2 => SELECT bookinfo FROM shelf WHERE id = 4; <bookstore><book><title>Helen's story about foxes &amp; rabbits</title></book></bookstore> 1 record(s) selected. Figure 4.22 Retrieving a document that contains entity references If you extract the title element and cast its text value to the SQL type VARCHAR, then the actual ampersand character appears in the output (see Figure 4.23). The functions XMLCAST and XMLQUERY are explained in Chapter 7, Querying XML Data with SQL/XML.
4.7 Understanding XML Whitespace and Document Storage 89 db2 => SELECT XMLCAST(XMLQUERY('$BOOKINFO/bookstore/book/title') AS VARCHAR(35)) as title FROM shelf; TITLE -----------------------------------Helen's story about foxes & rabbits 1 record(s) selected. Figure 4.23 4.7 Retrieving the title as SQL type VARCHAR UNDERSTANDING XML WHITESPACE AND DOCUMENT STORAGE Most XML documents contain whitespace, and its purpose is typically to improve readability. According to the XML standard, whitespace is any of the following characters and their respective Unicode code points. • space character (0x20) • CR, carriage return (0x0D) • LF, line feed (0x0A) • tab (0x09) The XML standard mandates that XML parsers must remove or replace any CR characters (0x0D) that appear in an XML document. Any two-character sequence CR LF is replaced by a single LF, and any CR character that is not followed by LF is also converted to a single LF. Whitespace can occur at various places in an XML document. For example, the simple document in Figure 4.24 contains whitespace in the following locations: • Between the element name “a” and the attribute “x” • On both sides of the “=” character that belongs to the attribute “x” • Within the double quotes that enclose the value of the attribute “x” • Between the start tag of element “a” and the start tag of element “b” • Trailing whitespace within the start and end tag of element “b” and within the end tag of element “a” • Between the start and end tag of element “b” • Between the end tag of element “b” and the start tag of element “c” • Inside the text value of element “c” • Between the end tag of element “c” and the end tag of element “a”
90 <a Chapter 4 x Figure 4.24 = " 1"> <b > </b > <c> 2 Inserting and Retrieving XML Data </c> </a > A sample document with whitespace The location of the whitespace matters. Depending on where a whitespace character occurs it is considered one of four types of whitespace: • Insignificant whitespace (trailing spaces in element or attributes names, spaces around the equality [=] symbol of an attribute, and others) • Significant whitespace (within attribute and elements values) • Boundary whitespace (between one tag and the next, if no other characters occur there) • Known whitespace (a single whitespace that precedes an attribute name) Figure 4.25 shows the same XML document as in Figure 4.24 and identifies the four types of whitespace. Note that the whitespace between the start and end tag of element “b” is considered boundary whitespace and not significant whitespace, because there are no other non-whitespace characters in the text value of element “b”. The whitespace in the text value of element “c” is significant, because there is another non-whitespace character (“2”) adjacent to this whitespace. significant known <a x = " 1" > <b insignificant Figure 4.25 significant insignificant > </b > <c> 2 </c> insignificant </a > boundary Different types of whitespace XML parsers always remove all insignificant whitespace, which is not specific to DB2 but required by the XML standard. The XML standard provides no option to preserve insignificant whitespace during XML parsing. On the other hand, significant whitespace is always preserved and there is no option to strip significant whitespace. Known whitespace is a single space (U+0020) that separates an attribute name from a preceding element name or attribute. Known whitespace is removed during XML parsing and not stored with the document. But, it gets reinjected during serialization when you retrieve the XML data in text format. Boundary whitespace can be preserved or removed (stripped). Figure 4.26 shows two versions of the sample document from Figure 4.25. In the first version, all insignificant and boundary whitespace has been stripped from the document. In the second version, insignificant whitespace has been stripped but boundary whitespace has been preserved. In DB2, the default behavior is to strip boundary whitespace, but you can choose to preserve boundary whitespace, if desired.
4.7 Understanding XML Whitespace and Document Storage 91 -- Document with boundary whitespace stripped: <a x="1"><b/><c> 2 </c></a> -- Document with boundary whitespace preserved: <a x="1"> Figure 4.26 <b> </b> <c> 2 </c> </a> Sample document with and without boundary whitespace preserved You can preserve boundary whitespace only if you insert or update documents without validation against an XML Schema. Validation always forces boundary whitespace to be stripped. NOTE 4.7.1 Preserving XML Whitespace DB2’s default behavior to strip boundary whitespace is desirable because it saves space on disk and in memory. Additionally, whitespace is typically not meaningful for applications that consume XML data. Hence, this default is likely the right choice for your application. However, if you encounter a case where boundary whitespace has to be preserved, DB2 supports three ways to enable whitespace preservation. Ordered by their precedence, they are • The special attribute xml:space inside XML documents • The explicit strip/preserve whitespace option in the XMLPARSE function • Changing the DB2 default behavior from “strip” to “preserve” with the CURRENT IMPLICIT XMLPARSE OPTION (see section 4.7.2) The XML standard defines the optional attribute xml:space that controls the stripping or preservation of whitespace. It can have the values preserve or default, where default means that whitespace is stripped. This attribute can be included in any element in an XML document. It affects the entire subtree under this element, unless it is overridden by other xml:space attributes at a deeper level of the document. If the xml:space attribute appears only in the root element of a document then it affects all boundary whitespace in the entire document. Any xml:space attributes override any whitespace settings in the XMPARSE function or the CURRENT IMPLICIT XMLPARSE OPTION. The drawback of xml:space attributes is that they often do not occur in XML documents and it can be time consuming to add them to every document before insertion into DB2. Also, when an xml:space attribute is in place, its effect can only be changed by removing or modifying the attribute in each document. Due to this lack of flexibility it is recommended not to use xml:space attributes. Instead, use the explicit whitespace option in the XMPARSE function or the CURRENT IMPLICIT XMLPARSE OPTION, which we explain later.
92 Chapter 4 Inserting and Retrieving XML Data Let’s look at the four INSERT statements in Table 4.3 through Table 4.6. They all insert a document with whitespace such as indentation and line breaks. The right column in each table shows the document and its whitespace after it has been retrieved from DB2. Run these INSERT statements in the CLP with the –t and the –q option (db2 –t –q). The –t option sets the semicolon as the default statement terminator. The –q option ensures that the CLP, as an application program for DB2, does not remove new line characters or other whitespace when sending statements to the DB2 server. The INSERT statement in Table 4.3 does not specify any whitespace option, which implies that all boundary whitespace is stripped. Since boundary whitespace includes line breaks, the document after retrieval is a continuous string without line breaks, spilling over multiple lines as needed. Note that significant whitespace in the title element has been preserved; that is, the spaces between the words This, is, a, space, and story. Table 4.3 Inserting XML without Preserving Whitespace INSERT statement: Document after retrieval from DB2: INSERT INTO shelf VALUES (10, '<bookstore> <book> <isbn>1851586666</isbn> <title>This is a space story</title> </book> </bookstore>') <bookstore><book><isbn>1851586666</isbn><tit le>This is a space story</title></book></bo okstore> The document that is inserted in Table 4.4 carries an xml:space attribute with the value preserve, which means that all boundary whitespace in this document is preserved. Hence, when you retrieve the document from DB2 all line breaks and indentation match the original document. Table 4.4 Inserting an XML Document with xml:space Attribute INSERT statement: Document after retrieval from DB2: INSERT INTO shelf VALUES (11, '<bookstore xml:space="preserve"> <book> <isbn>1851586666</isbn> <title>This is a space story</title> </book> </bookstore>') <bookstore xml:space="preserve"> <book> <isbn>1851586666</isbn> <title>This is a space story</title> </book> </bookstore> The INSERT statement in Table 4.5 wraps the XMLPARSE function with the explicit PRESERVE WHITESPACE clause around the document, which also preserves all boundary whitespace.
4.7 Understanding XML Whitespace and Document Storage Table 4.5 93 Inserting an XML Document with the XMLPARSE Function INSERT statement: Document after retrieval from DB2: INSERT INTO shelf VALUES (12,XMLPARSE(DOCUMENT '<bookstore> <book> <isbn>1851586666</isbn> <title>This is a space story</title> </book> </bookstore>' PRESERVE WHITESPACE)) <bookstore> <book> <isbn>1851586666</isbn> <title>This is a space story</title> </book> </bookstore> The INSERT statement in Table 4.6 uses the XMLPARSE function with the STRIP WHITESPACE option, and the document also carries the xml:space attribute in the book element. The effect is that all boundary whitespace is stripped, except within the book element and its child elements. The line breaks and indentation within the book element have been preserved according to the xml:space attribute. Table 4.6 Interaction between the XMLPARSE Function and xml:space Attribute INSERT statement: Document after retrieval from DB2: INSERT INTO shelf VALUES (13,XMLPARSE(DOCUMENT '<bookstore> <book xml:space="preserve"> <isbn>1851586666</isbn> <title>This is a space story</title> </book> </bookstore>' STRIP WHITESPACE)) <bookstore><book xml:space="preserve"> <isbn>1851586666</isbn> <title>This is a space story</title> </book></bookstore> 4.7.2 Changing the Whitespace Default from “Strip” to “Preserve” If you always need to preserve boundary whitespace you might find it tedious to ensure that all applications always use the XMLPARSE function with the PRESERVE WHITESPACE option. In this case it is easier to change DB2’s default behavior from STRIP WHITESPACE to PRESERVE WHITESPACE and avoid using the XMLPARSE function. In DB2 for Linux, UNIX, and Windows, the default behavior is controlled by a DB2 special register called CURRENT IMPLICIT XMLPARSE OPTION. It enables you to specify the whitespace handling per session (connection). You can change the default in several ways: • Use the following statement from an application or the DB2 CLP: SET CURRENT IMPLICIT XMLPARSE OPTION = 'PRESERVE WHITESPACE' • For CLI applications, add the following entry to the db2cli.ini file: CurrentImplicitXMLParseOption = 'PRESERVE WHITESPACE'
94 Chapter 4 Inserting and Retrieving XML Data You can edit this file manually, or issue the UPDATE CLI CONFIGURATION command: UPDATE CLI CONFIGURATION FOR SECTION <dbname> USING CurrentImplicitXMLParseOption '"PRESERVE WHITESPACE"' • In CLI applications you can also use the function SQLSetConnectAttr() to set the connection attribute SQL_ATTR_CURRENT_IMPLICIT_XMLPARSE_OPTION. It can be set before or after establishing a connection. Remember that the XMLPARSE function can always be used explicitly to override the default. 4.7.3 Storing XML Documents for Compliance Many applications have the requirement that once they store an XML document they can get “the same” document back. The key question is how the application defines “the same.” In many cases “the same” means that all element and attribute tags, all element and attribute values, all comments, processing instructions and namespaces, and all significant whitespace have to be returned in the same order and representation as in the original document. This notion of “the same” is sometimes also called Document Object Model fidelity. It means that the structure and data content of your XML documents is always preserved and reproducible, including digital signatures. DB2’s pureXML storage provides this fidelity. Some applications may take their definition of “the same” one step further. They might require that any XML document that they retrieve from a database is 100% byte-for-byte identical to the one that was inserted, including all insignificant whitespace. To ensure that the documents are byte-for-byte identical you must avoid XML parsing, because the output from an XML parser does not always contain all bytes that were in the original document. This behavior is irrespective of database storage, but inherent in how XML parsing is defined by the XML standard. For example, XML parsers are required by the XML standard to remove insignificant whitespace and normalize line endings. Otherwise they are not compliant. If you require exact byte-for-byte retention of XML documents then an XML column, which stores XML in a parsed format, should not be your only storage choice for the documents. You should store a second copy of each document in a BLOB or VARCHAR FOR BIT column in the same row. The parsed XML storage allows efficient querying while the binary copy is for auditing or compliance purposes. Note that character data types, such as CLOB or VARCHAR, do not guarantee that documents are stored without any byte modifications, because character data can be subject to code page conversion. Code page issues are explained in Chapter 20.
4.8 4.8 Summary 95 SUMMARY The basic manipulation of XML documents in DB2 is easy. You can use the familiar SQL statements INSERT, SELECT, and DELETE to add, retrieve, and remove XML documents from an XML column in a DB2 table. UPDATE statements can replace or modify XML documents, which is further discussed in Chapter 12. In INSERT, SELECT, and UPDATE statements, applications can use parameter markers and host variables to exchange XML documents with the DB2 server. Code samples in various programming languages are provided in Chapter 21. If you include an XML column name in the SELECT list of an SQL query, the column type in the result set is XML and the XML documents are implicitly serialized to their textual representation upon retrieval. Alternatively, the XMLSERIALIZE function allows you to perform explicit serialization. Explicit serialization means that the text form of the XML documents are returned in a non-XML data type of your choosing, such as BLOB, CLOB, or VARCHAR. The XMLSERIALIZE function can be used to force the generation of an XML declaration at the beginning of any document that you retrieve from DB2. The XML standard defines several reserved characters as well as whitespace characters. Reserved characters, such as the less-than sign (<) or the ampersand (&), cannot appear as-is in the values of an XML document and must be properly escaped. XML documents can contain significant, insignificant, and boundary whitespace. Insignificant whitespace can occur within element tags, such as <book >, and must be removed during XML parsing, as defined in the XML standard. Significant whitespace can occur in element and attribute values and is always preserved. Boundary whitespace, such as new line characters after XML elements, can be either preserved or stripped. DB2 strips boundary whitespace by default, but offers several ways to preserve boundary whitespace, such as the PRESERVE WHITESPACE option in the XMLPARSE function.
This page intentionally left blank
C H A P T E R 5 Moving XML Data his chapter looks at the different methods for moving large numbers of XML documents into and out of the database. We also discuss moving XML data between databases with federation and replication. These concepts apply to both DB2 for z/OS and DB2 for Linux, UNIX, and Windows. T You can perform basic insert and retrieval of XML documents with SQL INSERT and SELECT statements. However, for bulk processing of XML documents it is often more convenient and more efficient to use the LOAD utility, which is available on all platforms, or the IMPORT utility in DB2 for Linux, UNIX, and Windows. The tradeoff between insert, load, and import is the same as for relational data. Some applications might have to issue an individual INSERT statement for each XML document (row) as soon as it is received or generated to make new data instantly available for queries. In contrast, you might prefer to use the LOAD or IMPORT utilities if you receive a large number of XML documents in bulk, or if you can afford to accumulate new documents for a nightly batch operation. There is a 2GB limit per document for moving data into and out of a DB2 database on any platform. This limit applies to large objects as well as XML documents. In this chapter, we discuss the following topics: • Exporting, importing, and loading XML documents in DB2 for Linux, UNIX, and Windows (sections 5.1, 5.2, and 5.3, respectively) • Loading and unloading XML documents in DB2 for z/OS (sections 5.4 and 5.5) • Document validation with XML Schemas during load or import (section 5.6) 97
98 Chapter 5 Moving XML Data • Splitting large XML documents into smaller documents (section 5.7) • XML support in replication, federation, HADR, db2look, and db2move (sections 5.8 through 5.11) 5.1 EXPORTING XML DATA IN DB2 FOR LINUX, UNIX, AND WINDOWS You can use the EXPORT command in DB2 for Linux, UNIX, and Windows to move XML data, or a mix of XML and relational data, from the database to the file system. In fact, the EXPORT utility allows you to export the result of any query to the file system. If you are familiar with exporting LOB data then you will find that exporting XML data is very similar. In this section various options for exporting XML data are examined. We use a test table called customer2, based on the customer table in the DB2 sample database. This test table is created and populated using the two commands shown in Figure 5.1. It has one XML column while the original customer table in the DB2 sample database has two XML columns, one of which is initially empty. It contains six rows, each with an XML document in the info column. CREATE TABLE customer2 (cid INT, info XML); INSERT INTO customer2 SELECT cid, info FROM customer; Figure 5.1 Creating the test table to demonstrate the EXPORT command Before you export data, you need to identify or create a directory into which the data will be exported. Then you can use the EXPORT command in a number of different ways: • Export XML documents and combine them into a single file (see section 5.1.1) • Export XML documents as individual files (section 5.1.2) • Export XML documents as individual files with non-default file names (section 5.1.3) • Export XML documents to one or multiple dedicated directories (section 5.1.4) • Export fragments of XML documents (section 5.1.5) • Export XML documents with XML Schema information (section 5.1.6) Let’s look at each of these in turn. Some of the examples are based on a Windows file system, others on a UNIX or Linux file system. 5.1.1 Exporting XML Documents to a Single File The simplest form of the EXPORT command allows you to read all rows from a table, including any XML columns (see Figure 5.2). The EXPORT command starts with the keywords EXPORT TO plus the specification of the desired output file. In this example the output file is
5.1 Exporting XML Data in DB2 for Linux, UNIX, and Windows 99 c:\mydata\cust_exp.del in the Windows file system. The file name is followed by the keywords OF DEL to indicate that the type of the output file is delimited format. The remainder of the EXPORT command is a query whose result set is exported. This query can be more complex than the one shown in Figure 5.2. For example, it can also contain a WHERE clause to filter the exported documents or any of the XML query functions discussed in Chapters 6 through 9. EXPORT TO c:\mydata\cust_exp.del OF DEL SELECT * FROM customer2; Figure 5.2 Exporting all XML documents to a single file The EXPORT command in Figure 5.2 produces two output files: • cust_exp.del (see Figure 5.3) • cust_exp.del.001.xml (see Figure 5.4) The first file, cust_exp.del, is a delimited format flat file that holds the relational data of the exported result set. The second file, cust_exp.del.001.xml, holds the XML data of the XML column in the result set. By default, all XML documents from the XML column are concatenated in this file. The delimited format file, cust_exp.del, contains information that links the XML documents to the rows that they belong to, as shown in Figure 5.3. More specifically, the delimited format file contains one column for each column in the result set of the exported query. In this example it contains two columns. The first column holds the integer values of the exported column cid. The second column represents the exported column info and contains pointers to the corresponding XML documents in the file cust_exp.del.001.xml. These pointers are XML elements known as XML Data Specifiers (XDS). Each XML Data Specifier has three attributes: FIL, OFF, and LEN. These attributes represent the file name that contains the XML data, the byte offset from where a particular document starts, and the length of each XML document, respectively. 1000,"<XDS 1001,"<XDS 1002,"<XDS 1003,"<XDS 1004,"<XDS 1005,"<XDS Figure 5.3 FIL='cust_exp.del.001.xml' FIL='cust_exp.del.001.xml' FIL='cust_exp.del.001.xml' FIL='cust_exp.del.001.xml' FIL='cust_exp.del.001.xml' FIL='cust_exp.del.001.xml' OFF='0' LEN='281' />" OFF='281' LEN='283' />" OFF='564' LEN='282' />" OFF='846' LEN='408' />" OFF='1254' LEN='412' />" OFF='1666' LEN='421' />" Content of the delimited format flat file cust_exp.del The file cust_exp.del.001.xml contains all the XML documents from the exported XML column concatenated together, as shown in Figure 5.4. The second of the six documents is highlighted in bold. As indicated in the DEL file, it begins at byte offset 281 and has a length of 283.
100 Chapter 5 Moving XML Data You can actually count the characters in Figure 5.4 to verify that this is true. Also note that this concatenation of documents does not produce a single well-formed document because a single root element is missing. <?xml version="1.0" encoding="UTF-8" ?><customerinfo Cid="1000">< name>Kathy Smith</name><addr country="Canada"><street>5 Rosewood< /street><city>Toronto</city><prov-state>Ontario</prov-state><pcod e-zip>M6W 1E6</pcode-zip></addr><phone type="work">416-555-1358</ phone></customerinfo><?xml version="1.0" encoding="UTF-8" ?><cust omerinfo Cid="1001"><name>Kathy Smith</name><addr country="Canada "><street>25 EastCreek</street><city>Markham</city><prov-state>On tario</prov-state><pcode-zip>N9C 3T6</pcode-zip></addr><phone typ e="work">905-555-7258</phone></customerinfo><?xml version="1.0" e ncoding="UTF-8" ?><customerinfo Cid="1002"><name>Jim Noodle</name ><addr country="Canada"><street>25 EastCreek</street><city>Markha m</city><prov-state>Ontario</prov-state><pcode-zip>N9C 3T6</pcode -zip></addr><phone type="work">905-555-7258</phone></customerinfo ><?xml version="1.0" encoding="UTF-8" ?><customerinfo Cid="1003"> <name>Robert Shoemaker</name><addr country="Canada"><street>1596 Baseline</street><city>Aurora</city><prov-state>Ontario</prov-sta te><pcode-zip>N8X 7F8</pcode-zip></addr><phone type="work">905-55 5-7258</phone><phone type="home">416-555-2937</phone><phone type= "cell">905-555-8743</phone><phone type="cottage">613-555-3278</ph one></customerinfo>... Figure 5.4 5.1.2 Content of the XML data file cust_exp.del.001.xml Exporting XML Documents as Individual Files In some situations exporting each XML document into a separate file can be desirable. To do this you need to specify the clause MODIFIED BY with the option xmlinsepfiles. This is shown in Figure 5.5. EXPORT TO c:\mydata\cust_exp.del OF DEL MODIFIED BY xmlinsepfiles SELECT * FROM customer2; Figure 5.5 Exporting XML documents as separate files This EXPORT command produces n + 1 files where n is the number of XML documents in the exported XML column. In our example it produces the following seven files in the directory c:\mydata: • cust_exp.del • cust_exp.del.001.xml • cust_exp.del.002.xml • cust_exp.del.003.xml
5.1 Exporting XML Data in DB2 for Linux, UNIX, and Windows 101 • cust_exp.del.004.xml • cust_exp.del.005.xml • cust_exp.del.006.xml The first file is the delimited format flat file that contains the relational data of the exported result set together with pointers to the exported XML documents. These pointers (XML Data Specifiers) look different now because each XML document is exported as a separate file in the file system (see Figure 5.6). Offset and length are no longer required, just the file name of each individual XML document. These file names are derived from the name of the delimited format flat file and extended with an increasing number and the extension .xml. The file numbers start with three digits and additional digits are used as needed when large numbers of documents are exported. 1000,"<XDS 1001,"<XDS 1002,"<XDS 1003,"<XDS 1004,"<XDS 1005,"<XDS Figure 5.6 FIL='cust_exp.del.001.xml' FIL='cust_exp.del.002.xml' FIL='cust_exp.del.003.xml' FIL='cust_exp.del.004.xml' FIL='cust_exp.del.005.xml' FIL='cust_exp.del.006.xml' />" />" />" />" />" />" Content of the delimited format flat file cust_exp.del Remember that the examples in this chapter use the table customer2 which has an INTEGER column and an XML column. The table customer, which is readily available in the DB2 sample database, has an INTEGER column and two XML columns, info and history. Since the history column is initially empty (NULL), exporting all columns from the customer table leads to odd-numbered file names—cust_exp.del.001.xml, cust_exp.del.003.xml, cust_ exp.del.005.xml, and so on. The even-numbered file names would be used for the documents in the history column, but it is NULL and so these file names are not used. The xmlinsepfiles option used in Figure 5.5 is just one of many possible options that can be specified in the MODIFIED BY clause of the EXPORT command. Table 5.1 summarizes other options relevant to XML data. Table 5.1 XML Relevant Modifiers for the EXPORT Command Modified by: Description: xmlinsepfiles This option writes each XML document to a separate file. Without this option, all documents are by default concatenated into a single file. xmlnodeclaration This option produces XML documents without an XML declaration. Without this option the default behavior is that each exported XML document carries an XML declaration with an encoding attribute, such as <?xml version="1.0" encoding="UTF-8" ?> (continues)
102 Chapter 5 Table 5.1 Moving XML Data XML Relevant Modifiers for the EXPORT Command (Continued) Modified by: Description: xmlchar This option writes the exported XML documents in the character codepage. The character codepage is the same as the application codepage unless the codepage option of the EXPORT command is specified. Without the xmlchar option, XML documents are by default written out in Unicode. Chapter 20 provides a deeper discussion of code pages and XML document encodings. xmlgraphic This option writes the exported XML documents in the UTF-16 code page regardless of the application code page or the codepage modifier. 5.1.3 Exporting XML Documents as Individual Files with Non-Default Names If you want the exported XML documents to have file names that are not based on the file name of the delimited format flat file, use the XMLFILE clause of the EXPORT command to specify a different file name prefix. The command in Figure 5.7 exports the table customer2 and writes all XML documents to separate files whose names start with custdoc. EXPORT TO c:\mydata\cust_exp.del OF DEL XMLFILE custdoc MODIFIED BY xmlinsepfiles SELECT * FROM customer2; Figure 5.7 Exporting XML documents to files with custom file names This command produces the following files: • cust_exp.del • custdoc.001.xml • custdoc.002.xml • custdoc.003.xml • custdoc.004.xml • custdoc.005.xml • custdoc.006.xml The XMLFILE clause can also be used without the xmlinsepfiles option; that is, all documents are combined into a single file whose name starts with custdoc. 5.1.4 Exporting XML Documents to One or Multiple Dedicated Directories The EXPORT command allows you to write the exported XML documents to a dedicated directory that is different from the directory where the delimited format file is written to. To achieve this,
5.1 Exporting XML Data in DB2 for Linux, UNIX, and Windows 103 use the XML TO clause to specify an existing directory, as shown in Figure 5.8. This EXPORT command writes the delimited format flat file cust_exp.del to the directory /mydata, and the six XML documents in six separate files to the directory /mydata/customer. EXPORT TO /mydata/cust_exp.del OF DEL XML TO /mydata/customer MODIFIED BY xmlinsepfiles SELECT * FROM customer2; Figure 5.8 Exporting XML documents as individual files to a dedicated directory If the XML TO clause specifies a list of multiple directories, as in Figure 5.9, the XML documents are distributed evenly among them in a round-robin fashion. EXPORT TO /mydata/cust_exp.del OF DEL XML TO /mydata/cust1, /mydata/cust2 XMLFILE custdoc MODIFIED BY xmlinsepfiles SELECT * FROM customer2; Figure 5.9 Exporting XML documents as separate files to multiple directories This EXPORT command produces the following files: • /mydata/cust1/custdoc.001.xml • /mydata/cust1/custdoc.003.xml • /mydata/cust1/custdoc.005.xml • /mydata/cust2/custdoc.002.xml • /mydata/cust2/custdoc.004.xml • /mydata/cust2/custdoc.006.xml You can later invoke the IMPORT or LOAD utility with the same two paths, /mydata/cust1 and /mydata/cust2, to have DB2 read the same documents in the same round-robin fashion. If you specify multiple target directories in the XML TO clause but omit the xmlinsepfiles option, as in Figure 5.10, then the EXPORT utility concatenates the exported XML documents into multiple large files, one per target directory. EXPORT TO /mydata/cust_exp.del OF DEL XML TO /mydata/cust1, /mydata/cust2 XMLFILE custdoc SELECT * FROM customer2; Figure 5.10 Exporting XML documents to multiple directories
104 Chapter 5 Moving XML Data This EXPORT command produces the following three files: • The delimited format flat file cust_exp.del in the directory /mydata • A file called custdoc.001.xml in the directory /mydata/cust1 • A file called custdoc.002.xml in the directory /mydata/cust2 The exported XML documents are evenly distributed across the two files custdoc.001.xml and custdoc.002.xml. The delimited format flat file cust_exp.del contains the rows shown in Figure 5.11. It reveals that the first, third, and fifth documents are stored in the file custdoc. 001.xml, while the second, fourth, and sixth documents are stored in custdoc.002.xml. Each document is precisely identified by its offset and length. 1000,"<XDS 1001,"<XDS 1002,"<XDS 1003,"<XDS 1004,"<XDS 1005,"<XDS Figure 5.11 5.1.5 FIL='custdoc.001.xml' FIL='custdoc.002.xml' FIL='custdoc.001.xml' FIL='custdoc.002.xml' FIL='custdoc.001.xml' FIL='custdoc.002.xml' OFF='0' LEN='281' />" OFF='0' LEN='283' />" OFF='281' LEN='282' />" OFF='283' LEN='408' />" OFF='563' LEN='412' />" OFF='691' LEN='421' />" Content of the delimited format flat file cust_exp.del Exporting Fragments of XML Documents Up to now we have looked at exporting whole documents. It is also possible to export document fragments that may or may not be well-formed documents. To achieve this you can use the EXPORT command with any XQuery or SQL/XML query, such as the ones that we discuss in Chapters 6 through 9, which cover XML queries. Let’s consider the following examples. The command in Figure 5.12 exports all phone elements from each of the six XML documents in the info column of the table customer2. It writes six rows to the output files, one for each XML document. Each row contains one or more phone elements, depending on the number of phone elements in the respective document. If a row contains a sequence of multiple phone elements without a common root element, then this value is not a well-formed XML document. EXPORT TO /mydata/phones.del OF DEL SELECT XMLQUERY('$INFO/customerinfo/phone') FROM customer2; Figure 5.12 Exporting document fragments The query in the EXPORT command can also be an XPath or XQuery expression, as shown in Figure 5.13. Similar to the previous example in Figure 5.12, this command also exports all phone
5.1 Exporting XML Data in DB2 for Linux, UNIX, and Windows 105 elements from all six customer documents. However, it writes each phone element to a separate row in the output file, even if multiple phone elements come from the same XML document. This is because XQuery and SQL/XML queries that seem to be equivalent can produce result sets with different cardinalities. For details, please refer to Chapter 8 (see section 8.3.3, Result Set Cardinalities in XQuery and SQL/XML). EXPORT TO /mydata/phones.del OF DEL XQUERY db2-fn:xmlcolumn("CUSTOMER2.INFO")/customerinfo/phone; Figure 5.13 5.1.6 Exporting document fragments as well-formed documents Exporting XML Data with XML Schema Information An XML column can contain XML documents that have been validated against one or multiple XML Schemas when they were inserted or loaded. When you export validated XML documents, the EXPORT utility can produce information that tells you for each document which XML Schema it belongs to. This is achieved with the XMLSAVESCHEMA option in the EXPORT command. For each exported XML document that was validated against an XML Schema, the fully qualified SQL identifier of that XML Schema is stored as an attribute (SCH) in the corresponding XML Data Specifier (XDS). The SQL identifier of the XML Schema is the name under which you registered the XML Schema in DB2. If the exported document was not validated against an XML Schema or the schema no longer exists in the database, the SCH attribute is not included in the corresponding XDS. Figure 5.14 shows the command to export documents with XML Schema information. EXPORT TO /mydata/cust_exp.del OF DEL MODIFIED BY xmlinsepfiles XMLSAVESCHEMA SELECT * FROM customer2; Figure 5.14 Exporting documents specifying the XML Schema The delimited format flat file produced might look like the one in Figure 5.15. In this example it shows that the first two documents were validated against the XML Schema with the SQL identifier DB2ADMIN.CUSTXSD. The third and the fifth documents were validated against schema DB2ADMIN.CUSTXSD2, while the fourth and the sixth documents are not associated with any XML Schema. This information reflects how documents were validated at insert time, if at all. If you load or import the exported XML documents and use this delimited format flat file as input, the documents can be validated against their respective XML Schemas, if those schemas exist in the database.
106 Chapter 5 1000,"<XDS 1001,"<XDS 1002,"<XDS 1003,"<XDS 1004,"<XDS 1005,"<XDS Figure 5.15 5.2 FIL='cust_exp.del.001.xml' FIL='cust_exp.del.002.xml' FIL='cust_exp.del.003.xml' FIL='cust_exp.del.004.xml' FIL='cust_exp.del.005.xml' FIL='cust_exp.del.006.xml' Moving XML Data SCH='DB2ADMIN.CUSTXSD'/>" SCH='DB2ADMIN.CUSTXSD'/>" SCH='DB2ADMIN.CUSTXSD2'/>" />" SCH='DB2ADMIN.CUSTXSD2'/>" />" Content of the delimited format flat file cust_exp.del IMPORTING XML DATA IN DB2 FOR LINUX, UNIX, AND WINDOWS In DB2 9.1 for Linux, UNIX, and Windows you can use the IMPORT utility to move XML data into an XML column. Since DB2 Version 9.5 you can also use the LOAD utility to load XML data. The choice between IMPORT and LOAD is largely dependent on operating considerations, which are similar for XML as for relational data: • The LOAD utility typically performs better than the IMPORT utility because • It operates at the DB2 page level, whereas the IMPORT utility operates at the row level. • The data loaded by the LOAD utility is not logged in the transaction log. • The LOAD utility automatically parallelizes its workload. • If you use the IMPORT utility, then the target table can be kept fully accessible to other applications for insert and query operations at all times. In particular, you can start an IMPORT operation while other queries on the table are in progress. The LOAD utility has an online mode that allows queries (but no writes) against the target table while the LOAD is in progress. However, queries that started prior to the LOAD must be quiesced before a LOAD or online LOAD can be started. • If you have triggers on the target table, then these are fired if the IMPORT utility is used, but are not fired if the LOAD utility is used. • Both the IMPORT and LOAD utilities can optionally perform XML Schema validation and preserve whitespace in the XML documents. The IMPORT and LOAD utilities can be viewed as inverse operations to EXPORT. In particular, the IMPORT and LOAD utilities can directly consume the output produced by the EXPORT utility; that is, a delimited format flat file that contains pointers to the XML documents that reside in one or multiple separate files. If you want to IMPORT or LOAD data that wasn’t previously exported with the EXPORT command, you need to produce a delimited format file that looks as if it had been produced by the EXPORT utility.
5.2 Importing XML Data in DB2 for Linux, UNIX, and Windows 5.2.1 107 IMPORT Command and Input Files Assume you want to use the IMPORT command to add new rows to the table customer2, and that you have a directory c:\mydata in the file system that contains several files with XML documents that you want to import. This directory could contain thousands of files, but in this example let’s assume that you just have two XML files called data2.xml and data3.xml, each containing a single XML document. You can produce a delimited format flat file, such as the file data.del in Figure 5.16, which contains two columns. The first column holds INTEGER values for the first column of the target table, and the second column holds pointers to the XML documents that you want to import into the second column of the target table. 2000,"<XDS FIL='data2.xml' />" 2001,"<XDS FIL='data3.xml' />" Figure 5.16 Content of the delimited format input file data.del With this delimited format input file you can execute the IMPORT command shown in Figure 5.17. It assumes that the file data.del as well as the XML documents data2.xml and data3.xml are all located in the current directory. The keywords OF DEL indicate that the input file data.del is of type delimited format. IMPORT FROM data.del OF DEL INSERT INTO customer2; Figure 5.17 Importing XML documents If the required files are not located in the local directory then you must provide appropriate paths. For example, if the file data.del is located in the directory c:\mydata, and the XML documents are in the directory c:\mydata\myxml, then the IMPORT command in Figure 5.18 obtains the files from the appropriate locations. IMPORT FROM c:\mydata\data.del OF DEL XML FROM c:\mydata\myxml INSERT INTO customer2; Figure 5.18 Importing XML documents from specific locations Incorrect file paths in the IMPORT command are a very common mistake, so you want to pay extra attention to them! NOTE If you need to load XML data that was previously exported to multiple directories, specify the list of directories in the XML FROM clause. This clause corresponds to the XML TO clause of the EXPORT command.
108 Chapter 5 Moving XML Data If the two XML documents data2.xml and data3.xml happen to be concatenated as a single file (for example, docs.xml), then the delimited format input file needs to specify offset and length for each document, as in Figure 5.19. The first XML document starts at an offset of 0 bytes into the file and is 281 bytes long. The second XML document starts at offset 281 and is 283 bytes long, and so on for all XML documents that may be in the same file. Since it is tedious to determine the number of bytes of each document, such an input file with offsets and lengths is typically only used if it is available from a previous EXPORT operation or generated by an application. 2000,"<XDS FIL='docs.xml' OFF='0' LEN='281' />" 2001,"<XDS FIL='docs.xml' OFF='281' LEN='283' />" Figure 5.19 Input file for multiple concatenated documents As an aside, what happens if you have more than one XML column in the target table? To populate a table with two XML columns, the delimited format input file has to contain two XML Data Specifiers (XDS) per row, one for each XML column that you want to populate. Such an input file is shown in Figure 5.20. 2000,"<XDS FIL='data2.xml' />","<XDS FIL='data2b.xml' />" 2001,"<XDS FIL='data3.xml' />","<XDS FIL='data3b.xml' />" Figure 5.20 Input file to populate an integer column and two XML columns When you import, insert, or load XML data, insignificant whitespace is by default automatically stripped from the XML documents (see section 4.7, Understanding XML Whitespace and Document Storage). If you want to preserve whitespace, specify the XMLPARSE PRESERVE WHITESPACE clause in the IMPORT command (see Figure 5.21). IMPORT FROM c:\mydata\cust_exp.del OF DEL XML FROM c:\mydatadata XMLPARSE PRESERVE WHITESPACE INSERT INTO customer2; Figure 5.21 5.2.2 Importing XML data into a table and preserving whitespace Import/Insert Performance Tips Several performance guidelines are common to all methods of populating a table with XML data. If you have multiple user-defined XML indexes on a table, it is typically better to define them before populating the table rather than creating them afterwards. It is better to define the indexes before populating the table because during INSERT, LOAD, or IMPORT, each XML document is processed only once to generate index entries for all XML indexes. However, if multiple CREATE INDEX statements are issued, all documents in the XML column will be traversed multiple times, once for each index.
5.3 Loading XML Data in DB2 for Linux, UNIX, and Windows 109 Even if you have not defined any indexes on the target table, DB2’s pureXML storage mechanism transparently maintains regions and path indexes for efficient XML storage access (see Chapter 3, Designing and Managing XML Storage Objects). Take these indexes into account when determining buffer pool sizes. Just as for relational data, you can issue the ALTER TABLE <tablename> APPEND ON command, which enables append mode for the table. New data is appended to the end of the table instead of searching for free space on existing pages. This can provide for improved runtime performance of bulk inserts or import. You can avoid logging if you use the ALTER TABLE <tablename> ACTIVATE NOT LOGGED INITIALLY command. However, be warned that if there is a statement failure, the table will be marked as inaccessible and must be dropped. This risk often prohibits using the NOT LOGGED INITIALLY (NLI) option for incremental bulk inserts in production systems. The option can be useful for the initial population of an empty table. Beware that NLI prevents concurrent inserts/imports into a target table and that parallelism can yield higher performance than NLI. If you use the IMPORT command, a small value for the COMMITCOUNT parameter tends to hurt performance. Committing every 100 rows or more will perform better than committing every row. An IMPORT command with an explicit COMMITCOUNT parameter is shown in Figure 5.22. IMPORT FROM c:\mydata\data.del OF DEL XML FROM c:\mydata COMMITCOUNT 100 INSERT INTO customer2; Figure 5.22 IMPORT command with COMMITCOUNT parameter To achieve higher performance than provided by the IMPORT utility, consider using the LOAD utility instead, which automatically parallelizes its work. 5.3 LOADING XML DATA IN DB2 FOR LINUX, UNIX, AND WINDOWS Since DB2 9.5 for Linux, UNIX, and Windows you can use the LOAD utility to move XML documents into a table. The key advantages of the LOAD utility are the same for XML as for relational data. For example, the data is not logged and parallelism is automatically used to increase performance. DB2 determines a default degree of parallelism based on the number of CPUs and table space containers. The syntax for handling XML data in the LOAD command is the same as the XML-specific syntax in the IMPORT command. For example, the only difference between the LOAD command in Figure 5.23 and the IMPORT command in Figure 5.18 is that the keyword IMPORT has been replaced by the keyword LOAD.
110 Chapter 5 Moving XML Data LOAD FROM c:\mydata\data.del OF DEL XML FROM c:\mydata\myxml INSERT INTO customer2; Figure 5.23 Example of a LOAD command The LOAD command has several optional parameters that can affect performance. DB2 automatically determines suitable values for these parameters, so you can usually obtain good load performance out-of-the-box without setting any parameters. If you want to try to improve load performance, consider the following parameters: • DATA BUFFER <buffer-size>—This parameter specifies the number of 4KB pages (regardless of the degree of parallelism) to use as buffered space for transferring data within the utility. The data buffers use the utility heap, whose size can be modified through the util_heap_sz database configuration parameter. Large degrees of parallelism require a larger util_heap_sz. • CPU_PARALLELISM <n>—This parameter specifies the number of threads that the LOAD utility uses for parsing, converting, and formatting records. • DISK_PARALLELISM <n>—This parameter specifies the number of threads that the LOAD utility uses for writing data to the table space. After a LOAD operation, the loaded table might be in SET INTEGRITY PENDING state in either READ or NO ACCESS mode. This means that the table is only available for read or not available at all. You can check whether the loaded table is in SET INTEGRITY PENDING status (also known as CHECK PENDING status) by looking at the STATUS column of the catalog view SYSCAT.TABLES and checking for a STATUS value equal to "C" (see Figure 5.24). The value "C" means CHECK PENDING. SELECT SUBSTR(tabschema,1,10) AS tabschema, SUBSTR(tabname,1,10) AS tabname, status FROM syscat.tables WHERE status = 'C'; TABSCHEMA TABNAME STATUS ---------- ---------- -----DB2ADMIN CUSTOMER C Figure 5.24 Listing tables that are in CHECK PENDING state One of the most common reasons why a table is placed in CHECK PENDING state after a LOAD operation is that the table has check constraints or referential integrity constraints defined on it. To take a table out of CHECK PENDING state, issue the SET INTEGRITY command:
5.4 Unloading XML Data in DB2 for z/OS 111 SET INTEGRITY FOR db2admin.customer2 IMMEDIATE CHECKED DB2 performs minimal logging for the LOAD utility, because the operations are performed at the DB2 page level and not the DB2 row level. If you have DB2 archive logging enabled (disabled by default) and use the LOAD command, then the table will be placed in BACKUP PENDING status after the load. After the load operation you have to take a backup of the table space containing the table before you issue the SET INTEGRITY command. An alternative to taking the backup is to specify the COPY YES option in the LOAD command. This option instructs DB2 to perform a backup of the new data while it is being loaded, which avoids the BACKUP PENDING state. Another alternative is to specify the NONRECOVERABLE option in the LOAD command. This option means the table space is not put in BACKUP PENDING state following the LOAD operation and a copy of the loaded data does not have to be made during the load. However, it is not possible to recover the table by a subsequent roll forward action. You can also move XML data from one table to another using the “load from cursor” option of the LOAD utility. This option allows you to move data between tables without having to unload the data first. In Figure 5.25 a cursor curs is declared. The subsequent LOAD command uses this cursor to move data from the table customer2 into table customer3. Loading XML data from a cursor is supported for tables in the same database but not for moving XML data from one database to another (error SQL1407N). DECLARE curs CURSOR FOR SELECT cid, info FROM customer ; LOAD FROM curs OF CURSOR INSERT INTO customer3(cid,info) ; Figure 5.25 5.4 Example of loading data from a cursor UNLOADING XML DATA IN DB2 FOR Z/OS You have two options for unloading data from DB2 for z/OS. You can either use the DSNTIAUL utility or the UNLOAD utility. An example of using the DSNTIAUL utility to unload data from a table called customer is shown in Figure 5.26. The execution of the DSNTIAUL utility in Figure 5.26 produces two output files, pointed to by SYSREC00 and SYSPUNCH. The SYSPUNCH sequential dataset contains the LOAD statement for you to be able to load the unloaded data into a new table. The SYSREC00 sequential dataset contains the unloaded data, including the XML data.
112 Chapter 5 Moving XML Data //DSNTIAUL EXEC PGM=IKJEFT01 //SYSPRINT DD SYSOUT=* //SYSTSPRT DD SYSOUT=* //SYSREC00 DD DSN=USER123.DSN8UNLD.SYSREC00,VOL=SER=P8P007, // UNIT=SYSDA,SPACE=(32760,(1000,500)),DISP=(,CATLG) //SYSPUNCH DD DSN=USER123.DSN8UNLD.SYSPUNCH, // UNIT=SYSDA,SPACE=(800,(15,15)),DISP=(,CATLG), // RECFM=FB,LRECL=120,BLKSIZE=1200,VOL=SER=P8P007 //SYSTSIN DD * DSN SYSTEM(ISC9) RUN PROGRAM(DSNTIAUL) PLAN(DSNTIB91) PARMS('SQL') LIB('ISC910P8.RUNLIB.LOAD') END //SYSIN DD * SELECT * FROM CUSTOMER; Figure 5.26 Unloading data using the DSNTIAUL utility You can also use the UNLOAD utility to unload XML data. Remember that in DB2 for z/OS, the XML data of an XML column always resides in an XML table space, separate from the base table space. In the UNLOAD statement you just need to specify the base table space. You do not have to specify the XML table space. An example is shown in Figure 5.27, where the data is unloaded in delimited format. Once you have determined the table space and database for the table you want to unload, you can plug these values into the unload job as shown in Figure 5.27. //UNLOAD EXEC DSNUPROC,PARM='ISC9,IANTEX',COND=(4,LT) //SORTLIB DD DSN=SYS1.SORTLIB,DISP=SHR //SORTOUT DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //DSNTRACE DD SYSOUT=* //SYSPRINT DD SYSOUT=* //SYSTSPRT DD SYSOUT=* //SYSREC DD DSN=USER123.UNLOAD.SYSREC, // DISP=(MOD,CATLG,CATLG), // UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSPUNCH DD DSN=USER123.UNLOAD.SYSPUNCH, // DISP=(MOD,CATLG,CATLG), // UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSIN DD * UNLOAD TABLESPACE DSN00191.CUSTOMER DELIMITED CHARDEL X'22' COLDEL X'2C' DECPT X'2E' FROM TABLE CUSTOMER (CID POSITION(*) INT, INFO POSITION(*) XML) UNICODE /* Figure 5.27 Unloading data using the UNLOAD utility
5.4 Unloading XML Data in DB2 for z/OS 113 For maximum portability, you should specify UNICODE in the UNLOAD statement and use Unicode delimiter characters. If XML columns are not being unloaded in UTF-8 CCSID 1208, the unloaded column values are prefixed with a standard XML encoding declaration that specifies the encoding that is used. If the table that you unload contains XML documents larger than 32KB, you need to use file reference variables (FRV) to unload the XML data to a separate partitioned data set (PDS) or hierarchical file system (HFS) file. Figure 5.28 shows unload to a PDS. //SYSIN DD * TEMPLATE XMLHERE DSN 'USER123.&DB..&TS..UNLOAD' DSNTYPE(PDS) UNIT(SYSDA) UNLOAD DATA DELIMITED CHARDEL X'22' COLDEL X'2C' DECPT X'2E' FROM TABLE CUSTOMER (CID INT, INFO VARCHAR(255) CLOBF XMLHERE) UNICODE /* Figure 5.28 SYSIN cards for unloading XML documents larger than 32KB Let’s look at how the SYSIN cards in Figure 5.28 are constructed. The first two lines define a template with the name XMLHERE. The template declares the output naming pattern for the XML data files. The variables &DB and &TS take the value of the database and table space where the XML data is unloaded from. The parameter DSNTYPE specifies the type of volume for the unloaded data. If PDS is specified, then this limits the output dataset to a single volume. This is also the default if no DSNTYPE is specified. If the output should use multiple volumes, then you must specify HFS. Next is the UNLOAD DATA statement. The line starting with DELIMITED defines how the data is to be delimited. The last line specifies that the XML documents that are unloaded from the XML column INFO are represented in the output data by file names of up to 255 characters. The type VARCHAR(255) defines the data type of the XML file names, not of the actual XML data. The keyword CLOBF tells UNLOAD to use File Reference Variables (FRV) and to store the XML documents as CLOB files. You can also specify BLOBF or DBCLOBF as possible output file formats. The template name XMLHERE tells UNLOAD to name the XML files according to the template that was defined in the first line. If you do not specify EBCDIC, ASCII, UNICODE, or CCSID, the encoding scheme of the source data is preserved. If the output PDS that will contain the XML documents does not exist, the job will create it for you. The names of the output files are stored in the SYSREC data set as strings, as shown in Figure 5.29. 1000.USER123.DSN00201.XCUS0000.UNLOAD(B4C0WQCY) 1001.USER123.DSN00201.XCUS0000.UNLOAD(B4C0WQDR) 1002.USER123.DSN00201.XCUS0000.UNLOAD(B4C0WQEB) ... Figure 5.29 Contents of SYSREC DS when unloading documents larger than 32KB
114 Chapter 5 Moving XML Data You can see that the value of the relational column cid is the first part of each record. Each of the output files pointed to by the remainder of the record contains an XML document. Note the random member name. If the dataset already contains members when the job is run, then the existing members are not deleted, but new members (again with random names) are added. But the dataset that SYSREC points to is overwritten with the new names. The dataset pointed to by SYSPUNCH contains the statements that you need to put into a LOAD job, as shown in Figure 5.30. Such a LOAD job is discussed in section 5.5. LOAD DATA INDDN SYSREC LOG NO RESUME YES UNICODE CCSID(01208,01208,01208) FORMAT DELIMITED COLDEL X'2C' CHARDEL X'22' DECPT X'2E' SORTKEYS 3 INTO TABLE "USER123"."CUSTOMER" ("CID" POSITION(*) INTEGER, "INFO" POSITION(*) VARCHAR CLOBF MIXED PRESERVE WHITESPACE) Figure 5.30 5.5 Output SYSPUNCH DS when unloading records larger than 32KB LOADING XML DATA IN DB2 FOR Z/OS To load data into tables you use the LOAD utility, as shown in Figure 5.31. The data that was unloaded in Figure 5.27 is being loaded into a new table called customer2. This table has an INTEGER column and an XML column. Remember that only well-formed XML documents can be loaded into an XML column. //LOAD01 EXEC DSNUPROC,PARM='ISC9,IANTEX',COND=(4,LT) //SORTLIB DD DSN=SYS1.SORTLIB,DISP=SHR //SORTOUT DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SORTWK01 DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SORTWK02 DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SORTWK03 DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SORTWK04 DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //DSNTRACE DD SYSOUT=* //SYSPRINT DD SYSOUT=* //SYSTSPRT DD SYSOUT=* //MYSYSREC DD DSN=USER123.UNLOAD.SYSREC,DISP=SHR //SYSUT1 DD UNIT=SYSDA,SPACE=(4000,(50,50),,,ROUND) //SYSERR DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSDISC DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSMAP DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSIN DD * LOAD DATA INDDN (MYSYSREC) LOG NO RESUME YES UNICODE CCSID(01208,01208,01208) FORMAT DELIMITED COLDEL X'2C' CHARDEL X'22' DECPT X'2E' SORTKEYS 3 INTO TABLE "USER123"."CUSTOMER2" ( "CID" POSITION(*) INTEGER , "INFO" POSITION(*) XML PRESERVE WHITESPACE ) /* Figure 5.31 Example of a DB2 for z/OS LOAD job
5.5 Loading XML Data in DB2 for z/OS 115 Note: • If you have unloaded the data previously, using the jobs shown in Figure 5.26 or Figure 5.27, then the SYSIN records are the contents of the SYSPUNCH DD card in these jobs. • The PRESERVE WHITESPACE option has been specified for the XML column. It can be omitted, in which case the default behavior is not to preserve whitespace. • If you omit the UNICODE CCSID line, then you get the following error: “RECORD (1) WILL BE DISCARDED DUE TO 'CID' CONVERSION ERROR”. The Unicode input data for FORMAT DELIMITED must be UTF-8, which is CCSID 1208. • The COLDEL parameter specifies the column delimiter that is used in the input file. The default value is a comma (,). For ASCII and UTF-8 data this is X'2C', and for EBCDIC data it is a X'6B'. The CHARDEL parameter specifies the character string delimiter that is used in the input file. The default value is a double quotation mark ("). For ASCII and UTF-8 data this is X'22', and for EBCDIC data it is X'3F'. The DECPT parameter specifies the decimal point character that is used in the input file. The default value is a period (.). The default decimal point character is a period in a delimited file, X'2E' in an ASCII or Unicode UTF-8 file. When the XML data is loaded as a part of regular input records, specify XML as the input field type. The target column must be an XML column. The LOAD utility treats XML columns as variable-length data when loading XML directly from input records and expects a two-byte length field preceding the actual XML value. The internal XML tables are loaded when the base table is loaded. You cannot specify the name of the internal XML table for load. You also cannot directly load the DocID column of the base table space or specify a default value for an XML column. You can load XML documents from regular input records if the total input record length is less than 32KB. XML documents that don’t fit into 32KB input records must be loaded from separate files. To achieve this you need to modify the SYSIN cards in Figure 5.31 with the one in Figure 5.30. The SYSREC input dataset is the dataset you specified in the UNLOAD job in Figure 5.27. If you have documents larger than 32KB that come from a source other than a previous unload, you can load these into a table as follows. As an example let us use a document called DOC01, which is also the member name in a partitioned dataset called USER123.XMLLOAD. First you need to edit the dataset pointed to by SYSREC and add the relational value for the Cid column of the row, as shown next: 2000.USER123.XMLLOAD(DOC01) You can now use exactly the same SYSIN cards as before to load this document into the table customer2.
116 Chapter 5 Moving XML Data Note that DB2 for z/OS does not compress an XML table space during the LOAD process. If the XML table space is defined with COMPRESS YES, then you have to run a REORG to compress the data. 5.6 VALIDATING XML DOCUMENTS DURING LOAD AND INSERT OPERATIONS When you use the LOAD or IMPORT utilities in DB2 for Linux, UNIX, and Windows to move a large number of XML documents into a table, you can validate these documents against an XML Schema. Simply add the clause XMLVALIDATE USING SCHEMA to the LOAD or IMPORT command, as illustrated in Figure 5.32. LOAD FROM c:\mydata\load_customer.txt OF DEL XML FROM c:\mydatadata XMLVALIDATE USING SCHEMA db2admin.custxsd INSERT INTO customer; Figure 5.32 Performing XML Schema validation during LOAD In DB2 for z/OS there is no XMLVALIDATE option for the LOAD utility but you can validate documents after loading them into a table. This and other validation topics are covered in Chapter 17, Validating XML Documents against XML Schemas. 5.7 SPLITTING LARGE XML DOCUMENTS INTO SMALLER DOCUMENTS Most programmers find it convenient and efficient to work with an XML document granularity that matches the logical business objects of the application and the predominant granularity of access. For example, a single document per purchase order, per trade, per contract, per tax return, per customer, and so on is usually a good idea. Smaller documents can be manipulated more efficiently than larger ones. Also, indexed access and data retrieval is faster for smaller documents. However, for a bulk transfer of XML data outside the database, such as FTP, it is often not convenient to handle thousands or millions of separate documents. Therefore, it is common to receive large XML documents, often several hundred megabytes per file, which contain many repeating blocks that represent independent objects. Many external XML tools fail, or have severe problems, when you try to open such large XML documents, typically due to document object model (DOM) parsing and memory limitations. DB2 can ingest XML documents up to 2GB. Optionally, you can split them into smaller documents using the XMLTABLE function. The XMLTABLE function is discussed in detail in Chapter 7, Querying XML Data with SQL/XML. Here we show one simple example of how it can split up documents.
5.7 Splitting Large XML Documents into Smaller Documents 117 Assume you need to manage many XML documents with the following (simplified) structure: <account> <id>1</id> <name>Heather</name> <amount>12.34</amount> </account> You may receive many of these documents in one large file that has a root element <accounts>. The root element is required for the file to be a well-formed document. Otherwise it cannot be processed in DB2. The large file looks like this: <accounts> <account> <id>1</id> <name>Heather</name> <amount>12.34</amount> </account> <account> <id>2</id> <name>Helen</name> <amount>56.78</amount> </account> … </accounts> Your first step is to insert, import, or load this document into a staging table that has a column of type XML, such as this one: CREATE TABLE staging(xcol XML) When this table contains the large document in a single row, you can read the document from the staging table, split it into the individual account documents, and insert those into the following target table: CREATE TABLE accounts(acc XML) To split the large document, use one of the two INSERT statements in Figure 5.33. Both accomplish the same thing; that is, they produce one row (document) in the target table for each account element in the large input document. You must create an XML document node for each newly created account document, either with the SQL/XML function XMLDOCUMENT, or with the XQuery function document{}. The latter is only available in DB2 for Linux, UNIX, and Windows. The first of the two statements in Figure 5.33 is suitable for DB2 for z/OS.
118 Chapter 5 Moving XML Data INSERT INTO accounts(acc) SELECT XMLDOCUMENT(x.val) FROM staging, XMLTABLE('$x/accounts/account' passing xcol as "x" COLUMNS val XML PATH '.') AS x; INSERT INTO accounts(acc) SELECT x.val FROM staging, XMLTABLE('$XCOL/accounts/account' COLUMNS val XML PATH 'document{.}') AS x; Figure 5.33 Splitting a large document After the insert operation, select the data from accounts to verify that the large input document has been split correctly (see Figure 5.34). SELECT acc FROM accounts; <account> <id>1</id> <name>Heather</name> <amount>12.34</amount> </account> <account> <id>2</id> <name>Helen</name> <amount>56.78</amount </account> 2 record(s) selected. Figure 5.34 Selecting the split documents from the target table Instead of reading the large input file from a staging table, you can also pass it into the INSERT statement in Figure 5.33 via a parameter marker. See Chapter 11, Converting XML to Relational Data, for related examples. The input file can also be read from the file system with one of the UDFs explained in section 4.1.2, Reading XML Documents from Files or URLs. 5.8 REPLICATING AND PUBLISHING XML DATA In this section we briefly discuss how XML data can be replicated and published using WebSphere Replication Server and InfoSphere Data Event Publisher V9.5. This is applicable to DB2 for Linux, UNIX, and Windows and DB2 for z/OS. At the time of writing there is no support for
5.8 Replicating and Publishing XML Data 119 the XML data type in SQL replication. If you want to replicate XML data you must use Q replication. The XML data type is supported as a replication source for WebSphere Replication Server (Q replication) from DB2 9.5 onwards. Note that • Q replication uses WebSphere MQ as the message transport mechanism, and as such there is a limit of around 100MB to the size of the XML document you can replicate. • You cannot filter replication based on the contents of the XML documents. • There is no automatic validation of XML documents by Q Apply at the target. If you want to perform XML validation at the target, you can define a trigger to achieve that (see section 17.5, Automatic Validation with Triggers). • There is no replication of XML Schemas or schema registrations. You can use WebSphere Replication Server (Q replication) to replicate tables containing XML data type columns in a Unidirectional, Bidirectional, and peer-to-peer mode. It is outside the scope of this book to describe the details of setting up Q replication. See the DB2 Information Center for further details. WebSphere Replication Server 9.7 has added additional XML capabilities in Q Apply for DB2 for z/OS and DB2 for Linux, UNIX, and Windows. Generally, Q Apply enables you to define custom SQL expressions to manipulate the data as it is integrated into the target database. These custom expressions can now include a selected set of XML functions such as XMLQUERY, XMLPARSE, XMLSERIALIZE, XMLCAST, XMLVALIDATE, and XMLDOCUMENT. These functions enable a wide range of useful XML document manipulations, including the following: • Use XMLPARSE to replicate XML documents from VARCHAR columns to XML columns. • Use XMLSERIALZE to replicate XML documents from XML columns to CLOB or BLOB columns • Use XMLQUERY with an XPath expression to extract XML fragments (subtrees) from the source documents. Add the XMLDOCUMENT function to create document nodes for the XML fragments. • Use XMLQUERY to extract individual XML element or attribute values from XML source documents. Add XMLCAST to convert these XML values to SQL data types. (The XMLTABLE function is not supported in Q Apply.) • Use XMLQUERY to apply XQuery update expressions to the replicated document; for example, to delete, add, or rename individual elements or attributes in the document. • Use XMLVALIDATE to validate the documents with an XML Schema at the target database.
120 5.9 Chapter 5 Moving XML Data FEDERATING XML DATA Database federation means that one DB2 database acts as a federated server that has access to remote data sources. Remote data sources can include other DB2 databases, non-DB2 databases, flat files, Excel files, message queues, and other sources. The purpose is that these remote data sources appear in the federated DB2 server as if they are local DB2 tables. The federated server provides applications with the illusion that all visible data resides in local relational tables. Federation hides the fact that some of this data is actually located at remote sources, which may or may not be relational databases. The official product name for this functionality is InfoSphere Federation Server (formerly WebSphere Federation Server) although it is actually a DB2 feature. Consider the scenario where one DB2 database acts as the federated server, and another DB2 database acts as a remote data source. Tables in the remote database can be registered as nicknames in the federated server. Subsequently they appear as if they were local tables. These tables can contain columns of type XML. Federation allows an application to connect to just one database, the federated server, and have access to XML data in other DB2 databases. You can federate XML data stored in XML columns in DB2 for Linux, UNIX, and Windows using the DRDA wrapper. For a simple test with federation, two databases are required. Let’s assume we have two databases, samplxml and samplsql, and that samplsql is the federated server while samplxml is the remote data source. If the databases are not in the same instance, which is often the case, you need to catalog the remote database in the federated server instance. Figure 5.35 shows the steps to configure the federated system. The goal is to make the customer table in the remote database samplxml locally visible in the database samplsql. -- enable federation for the instance which contains -- the federated server: UPDATE DBM CFG USING federated yes; db2stop; db2start; -- connect to the federate server database: CONNECT TO samplsql; -- create a DRDA wrapper: CREATE WRAPPER DRDA LIBRARY 'db2drda.dll'; -- register the other database (samplxml) as a data source -- and assign the local name "remoteXML": CREATE SERVER remoteXML TYPE DB2/UDB VERSION '9.5' WRAPPER DRDA AUTHID "db2admin" PASSWORD "*****" OPTIONS( ADD DBNAME 'samplxml'); Figure 5.35 Configuring the federated server for access to a remote data source
5.10 Managing XML Data with HADR 121 -- register authentication credentials for the data source: CREATE USER MAPPING FOR db2admin SERVER remoteXML OPTIONS ( ADD REMOTE_AUTHID 'db2admin', ADD REMOTE_PASSWORD '*****'); -- create the local nickname "custtable" for the remote -- table "customer": CREATE NICKNAME custtable FOR remoteXML.db2admin.customer; Figure 5.35 Configuring the federated server for access to a remote data source (continued) The remote database samplxml contains a table db2admin.customer, and this table is now visible in the local database samplsql under the (nick)name custtable. After these steps, the nickname custtable can be used as if it were a local DB2 table. You can check that the federation setup has worked by issuing an XQuery or SQL/XML query against the nickname, as shown in Figure 5.36. SELECT XMLQUERY('$INFO/customerinfo/name') FROM custtable WHERE cid = 1004; <name>Matt Foreman</name> 1 record(s) selected. Figure 5.36 Testing federated access to XML data Any changes to the XML data at the data source are immediately reflected in any queries run against the nickname. You can also perform updates against the nickname to change the remote data at the data source. If your queries against the nickname contain XML predicates, such as XMLEXISTS, these predicates are currently not pushed down for evaluation at the database source. XML predicates in queries against nicknames are evaluated locally at the federated server. 5.10 MANAGING XML DATA WITH HADR You can use the High Availability Disaster Recovery (HADR) feature in DB2 for Linux, UNIX, and Windows with XML columns in your tables. All relevant XML operations, such as INSERT, UPDATE, and DELETE of XML data are captured to the DB2 log and shipped to the standby database.
122 5.11 Chapter 5 Moving XML Data HANDLING XML DATA IN DB2LOOK AND DB2MOVE This section discusses the DB2 for Linux, UNIX, and Windows utilities db2look and db2move. The db2look utility can be used to connect to an existing database and produce a script of Data Definition Language (DDL) statements to recreate all objects in the database. The utility collects the definitions of all tables, indexes, table spaces, and so on and writes them to a file, which allows you (or IBM support) to produce an empty copy of the database. This copy can be helpful for troubleshooting or testing purposes. The following is an example of a typical invocation of the db2look utility at the operating system prompt to collect object definitions from the sampxml database: db2look -d sampxml -e –l -o db2look.txt The -e and -l parameters specify that the DDL statements for tables, views, indexes, buffer pools, table spaces, and so on are to be extracted into the file specified by the –o parameter. The output file db2look.txt automatically includes information for all XML columns and XML indexes, but does not contain any information about XML Schemas that may exist in DB2’s XML Schema Repository. You must specify the –xs option to also obtain XML Schema information: db2look -d sampxml –e -l -xs -o db2look.txt -xdir c:\xml The -xs option exports all files necessary to register XML Schemas and DTDs in a new database, and generates appropriate commands for registering them. If you want to export XML Schemas to a location other than the current directory, use the –xdir option to specify a different directory. This directory must exist before the command is run. The files that are written to this directory start with "doc_" or "md_". The "doc_" files contain the actual XML Schema documents. The md_ files contain optional metadata about the XML Schemas. If you examine the output file db2look.txt you find that the "doc_" files are used in the REGISTER XMLSCHEMA commands and the "md_" files in the COMPLETE XMLSCHEMA commands (see Figure 5.37). -- DDL Statements for XSR object "DB2ADMIN"."CUSTOMER" REGISTER XMLSCHEMA "http://posample.org" FROM c:\xml\xml\doc_562949953421312 AS "DB2ADMIN"."CUSTOMER"; COMPLETE XMLSCHEMA "DB2ADMIN"."CUSTOMER" WITH c:\xml\xml\md_281474976710656; Figure 5.37 Fragment of output from the db2look command
5.12 Summary 123 The db2move utility is also fully aware of XML data. The EXPORT option of the utility automatically includes XML data. The following command writes the contents of the database sampxml to files: db2move sampxml EXPORT This command creates a set of output files for each table in the database. For the customer table, the following files are produced in the local directory: 5.12 • tab3.ixf (contains the relational data) • tab3a.001.xml (contains the XML data) • tab3.msg (contains any messages produced by the EXPORT utility) SUMMARY All of DB2’s data movement utilities and features support XML data and XML columns in DB2 tables. In particular, this support includes import, export, load, unload, replication, federation, and high availability disaster recovery (HADR). The handling of XML data in the import, export, load, and unload utilities is very similar to the handling of LOBs. A common requirement is to split a large XML document into smaller documents, and this can be achieved with the XMLTABLE function. Just remember that whenever you take a fragment of an XML document and try to insert it into an XML column as an individual document, a new XML document node needs to be added. This chapter has provided you with the tools and skills to move XML data into and out of a DB2 database. The next four chapters deal with querying XML data that is stored in XML columns in the database.
This page intentionally left blank
C H A P T E R 6 Querying XML Data: Introduction and XPath T his is the first of four chapters that discuss methods for querying XML data. This chapter lays the foundation for the next three chapters and discusses the following topics: • Overview of the different options for querying XML data (section 6.1) • The XPath and XQuery data model (section 6.2) • The XPath language (sections 6.3 through 6.15) XPath provides the basic means for traversing XML documents, evaluating predicates, and retrieving XML values. XPath is at the very core of querying XML data in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. A good understanding of XPath and its data model is essential for querying XML data in DB2. The subsequent chapters then expand on these fundamental concepts as follows: • Chapter 7, Querying XML Data with SQL/XML, describes SQL/XML and how to embed XPath in SQL statements. • Chapter 8, Querying XML Data with XQuery, covers the XQuery language, which is a superset of XPath. • Chapter 9, Querying XML Data: Advanced Queries and Troubleshooting, covers advanced XML queries with joins, aggregation, and case-insensitive predicates. It also discusses common errors and guidelines for avoiding “bad” queries. This book does not provide a formal and complete XPath and XQuery language reference that covers all functions and features. We explain the most commonly used language features and how they are supported in DB2. We focus on the practical use of these language features, not on their 125
126 Chapter 6 Querying XML Data: Introduction and XPath formal definition. Appendix C, Further Reading, contains pointers to further reading about XPath and XQuery. To ease the introduction of the XML query languages, we defer the discussion of XML namespaces to Chapter 15, Managing XML Data with Namespaces. 6.1 AN OVERVIEW OF QUERYING XML DATA The basic language to query XML data is XPath, which is a subset of XQuery. XQuery adds additional expressions and language constructs to XPath and supports more advanced queries. XQuery and XPath have been standardized by the World Wide Web Consortium (W3C). Furthermore, the SQL:2006 standard includes functions that allow you to embed XQuery or XPath in SQL statements. Figure 6.1 shows the relationships between XPath, SQL/XML, and XQuery. Both XPath and XQuery are based on the same data model, which is called the XQuery 1.0 and XPath 2.0 Data Model. This data model defines how XML data is represented so that XPath and XQuery can operate on it. Section 3.1, Understanding XML Document Trees, already described how the data model defines the tree representation of XML documents. Queries expressed in XQuery or XPath typically traverse these XML document trees, evaluate predicates, and retrieve XML values. SQL/XML ISO/IEC 9075-14:SQL/XML XQUERY 1.0 Expressions XPATH 2.0 http://www.w3.org/TR/xquery http://www.w3.org/TR/xpath20 Functions & Operators http://www.w3.org/TR/xquery-operators/ XQuery 1.0 and XPath 2.0 Data Model http://www.w3.org/TR/query-datamodel/ Figure 6.1 Relationship between XQuery, XPath, and SQL/XML With the languages shown in Figure 6.1 you can query your XML in any of the following five ways: • Plain SQL: allows full-document retrieval (see Chapter 4) • SQL/XML: XPath embedded in SQL (see Chapters 7 and 9)
6.1 An Overview of Querying XML Data 127 • SQL/XML: XQuery embedded in SQL (see Chapters 8 and 9) • XQuery as a stand-alone language (see section 6.5 and Chapters 8 and 9) • SQL embedded in XQuery (see Chapters 8 and 9, sections 8.8, 8.9, and 9.2) DB2 9 for z/OS does not support XQuery, which means that options 3, 4, and 5 are only available in DB2 for Linux, UNIX, and Windows. This is not as big of a limitation as it might seem at first sight. SQL/XML with embedded XPath expressions (option 2) is a very powerful combination and sufficient for a very wide range of applications. Section 8.3, Comparing FLWOR Expressions, XPath Expressions, and SQL/XML, shows that many queries in XQuery notation can also be expressed in SQL/XML with XPath. Plain SQL without any XQuery or XPath is really only useful for full-document retrieval and operations such as insert, delete, and update of whole documents. Selection of documents must be based on non-XML columns in the same table. XPath embedded in SQL statements provides very broad functionality. You can express predicates on XML columns, extract document fragments, pass parameter markers to XML predicates, use full-text search, and perform efficient aggregation and grouping. This approach also allows you to join XML columns or to combine and join XML with relational data. Most applications are well served by this approach. XQuery embedded in SQL statements offers the broadest functionality due to the increased richness of XQuery over XPath. For example, XQuery provides advanced concepts such as direct XML element constructors, conditional expressions, or nested iterations over XML nodes. If you use DB2 for z/OS, note that some of the XQuery features that are not available in XPath can often be compensated for by using SQL features. For example, you can often use a SQL CASE expression to achieve the same result as the XQuery if-then-else expression. Also, the SQL/XML publishing functions can construct XML data much like the direct element constructors in XQuery. XQuery as a stand-alone language is a good option if your applications require querying and manipulating of XML data only, and do not involve any relational data. Also, if you are migrating from an XML-only database to DB2 and already have an existing XQuery workload, you might prefer to stick with plain XQuery. XQuery with embedded SQL can be a good choice if you want to leverage relational predicates and indexes as well as full-text search to pre-filter the documents from an XML column that are then input to an XQuery. SQL embedded in XQuery also allows you to run external functions and UDFs on the XML data. But, queries with grouping, aggregations, and parameter markers are typically better done in SQL/XML. No matter what combination of SQL and XQuery you choose in one statement, DB2 uses a single query compiler to produce and optimize a single execution plan for the entire query. Table 6.1 summarizes the respective advantages of the different options for querying XML data in DB2.
128 Chapter 6 Querying XML Data: Introduction and XPath In this table “–” indicates that the given approach does not support a feature, “+” means that the feature is supported but that a more efficient or convenient way might exist, and “++” signifies the feature is very well supported. Table 6.1 Characteristics of XML Query Options in DB2 Plain SQL SQL/XML with XPath or XQuery Plain XQuery XQuery with Embedded SQL XML predicates – ++ ++ ++ Relational predicates ++ ++ – + Parameter markers for XML predicates – ++ – – Joining XML and relational – ++ – ++ Joining XML with XML – ++ ++ ++ Suitability for XML-only applications – + ++ + Insert, update, delete ++ ++ – – Transforming XML data – + ++ ++ Full-text search + ++ – ++ Aggregation and grouping – ++ + + User-defined functions ++ ++ – ++ 6.2 UNDERSTANDING THE XQUERY AND XPATH DATA MODEL The XQuery and XPath Data Model (commonly known by its short name of XQuery Data Model) defines how XML data is represented so that XQuery and XPath queries can operate on it in a consistent and well-defined manner. The more complex your XML queries and updates become, the more you will find that a good understanding of the XQuery data model is beneficial. It helps you with the following tasks: • Writing correct XML queries and updates • Understanding the behavior of complex queries and expressions • Debugging and correcting XML queries 6.2.1 Sequences At the core of the XQuery data model is the definition of permissible values, which are also called instances of the data model. Roughly speaking, instances of the XQuery data model include XML documents as well as document fragments, individual elements, attributes, and
6.2 Understanding the XQuery and XPath Data Model 129 atomic values. A more precise definition follows shortly. A fundamental concept to remember is that an XQuery always takes one instance of the data model as input, and produces another instance of the data model as output. Every instance of the XQuery data model is a sequence. A sequence is an ordered collection of zero, one, or multiple items. An item is either an atomic value or a node. Atomic values include strings, dates, integers, decimals, double precision numbers, and so on, as defined by the XML Schema specification. Their types are xs:string, xs:date, xs:integer, xs:decimal, xs:double, and so on. Examples of atomic values are the following: • 100 is an atomic value of type xs:integer. • 1.5 and 3.145634785348 are atomic values of type xs:decimal. • 1E8 is an atomic values of type xs:double. • The strings “Peter” and “this is a cat” are atomic values of type xs:string. • The expression xs:date("2009-05-09") converts a string into an atomic value of type xs:date. A node is either a document node, an element node, an attribute node, a text node, a comment node, or a processing instruction node. An element node represents an XML element, an attribute node represents an XML attribute, and so on. Element nodes can have children (child nodes) to form hierarchies of nodes. An XML document is such a node hierarchy where the topmost node is a document node. See section 3.1, Understanding XML Document Trees, and Figure 3.2, for further details. In the following list of examples, sequences are written as comma-separated lists of items, with the whole list enclosed in parentheses and strings enclosed in double quotes. This is actually how sequences can be constructed in XQuery. • This sequence consists of four atomic values: (100, "Peter", "This is a cat", 1.5) • This is the empty sequence, which contains zero items:() • This sequence contains one item, which is an atomic value: ("555 Bailey Avenue") In the XQuery Data Model there is no difference between a single value and a sequence of length 1 that contains that single value. • Since sequences are ordered, the sequences (6, "F", 3) and (3, 6, "F") are different from each other.
130 Chapter 6 Querying XML Data: Introduction and XPath • Sequences are never nested. Concatenating the sequences (1,2,3) and (4,5) produces the sequence (1,2,3,4,5), and not (1,2,3,(4,5)) or ((1,2,3),(4,5)). A sequence is never an item in another sequence. • The following is a sequence of three XML element nodes: (<a></a>, <b>34</b>, <c><d>John</d></c>) The first element (a) is empty. The second element (b) has a child node, which is a text node with the atomic value 34. The third element (c) has a child node (d), which in turn contains a text node with the value “John”. • This next sequence contains one item, which is a node that has two child nodes. The two child nodes are an attribute node (id) and another element node (name): (<customer id="123"><name>John Doe</name></customer>) • If you have a table T with an XML column XMLCOL, the XML documents in the column XMLCOL can form a sequence. You see later in this chapter that the function db2-fn:xmlcolumn("T.XMLCOL") produces exactly that sequence of documents (section 6.5). • Sequences can contain a mix of nodes and atomic items: (167, <p>This is a cat</p>, "Peter", <name><first>Peter</first></name>, 1.53E9 ) 6.2.2 Sequence in, Sequence out An XQuery always takes one sequence as input1, and produces another sequence as output. Let’s look at some examples, before we introduce XPath and XQuery more formally in the next sections. In Figure 6.2, the input is a sequence that contains a single XML document. The output is a sequence that contains a single text node; that is, the text value of the name element. Input: XQuery: Output: Figure 6.2 (<customer id="123"><name>John Doe</name></customer>) /customer/name/text() (John Doe) XPath that returns a text node In Figure 6.3, the input is a sequence of four XML documents. The XQuery is a path expression that returns the text values of <b> elements that are child nodes of <a> elements. The result is a sequence of three text nodes. Note that the third document in the input sequence does not contribute anything to the result of the query. This is because its element names do not match the element names in the XQuery: 1. You will see later that advanced XQuery expressions can even take multiple sequences as input. A join between two columns is a typical example. For now, let’s keep things simple and assume a single input sequence.
6.3 Sample Data for XPath, SQL/XML, and XQuery Input: XQuery: Output: Figure 6.3 131 (<a><b>15</b></a>, <a><b>27</b></a>, <c><d>19</d></c>, <a><b>Peter</b></a>) /a/b/text() (15, 27, Peter) XPath that returns a sequence of multiple text nodes The same input is used in Figure 6.4, but the XQuery returns a sequence of element nodes. Input: XQuery: Output: Figure 6.4 (<a><b>15</b></a>, <a><b>27</b></a>, <c><d>19</d></c>, <a><b>Peter</b></a>) /a/b (<b>15</b>, <b>27</b>, <b>Peter</b>) XPath that returns a sequence of multiple elements The query in Figure 6.5 looks for <cstomer> elements. This may be intended or due to a misspelled tag name in the query. Either way, such elements are not found and so the empty sequence is returned: Input: XQuery: Output: Figure 6.5 (<customer id="123"><name>John Doe</name></customer>) /cstomer/name () Misspelled element in an XPath returns an empty sequence In Figure 6.6 the input is a sequence with a single item, which is a well-formed XML document. The output is a sequence of three atomic values: Input: XQuery: Output: Figure 6.6 (<a> <b>15</b> <b>27</b> <b>Peter</b> </a>) /a/data(b) (15, 27, Peter) XPath that returns a sequence of atomic values The difference between data(), which produces atomic values, and text(), which produces text nodes in Figure 6.3, is explained in more detail in the next section. Now that you have an understanding of the XQuery data model, let’s properly introduce the XQuery language. We start with XPath, a subset of XQuery. 6.3 SAMPLE DATA FOR XPATH, SQL/XML, AND XQUERY We use the two XML documents in Figure 6.7 as sample data to illustrate the concepts of XPath in the remainder of this chapter. These documents also serve as the sample data for most of the SQL/XML and XQuery examples in Chapters 7 and 8.
132 Chapter 6 Querying XML Data: Introduction and XPath To be precise, the input for the XPath queries in this chapter is a sequence of two items, which are the two documents in Figure 6.7. Remember that XPath is a subset of XQuery, so every XPath takes a sequence of items as input and produces another sequence as output. <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> </customerinfo> <customerinfo Cid="1004"> <name>Matt Foreman</name> <addr country="Canada"> <street>1596 Baseline</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M3Z 5H9</pcode-zip> </addr> <phone type="work">905-555-4789</phone> <phone type="home">416-555-3376</phone> <assistant> <name>Gopher Runner</name> <phone type="home">416-555-3426</phone> </assistant> </customerinfo> Figure 6.7 Two sample documents to demonstrate XPath navigation In the following sections we explain XPath through a series of examples. Each example consists of an XPath expression and its result, which is based on the input data in Figure 6.7. If the result is a sequence of more than one item, each item is placed in a separate row. We first explain simple XPath in generic terms, then how to run XPath in DB2, and finally additional XPath features such as wildcards, predicates, logical expressions and so on. Unless otherwise noted, all XPath, XQuery, and SQL/XML features (such as predicates, arithmetic, casting, built-in functions, and so on) work the same way for elements and attributes. 6.4 INTRODUCTION TO XPATH XPath provides the basic means for traversing XML documents, evaluating predicates, and retrieving XML values. XPath is the bread and butter for querying XML data in DB2 for z/OS and DB2 for Linux, UNIX, and Windows.
6.4 Introduction to XPath 6.4.1 133 Analogy Between XPath and Navigating a File System The fundamental concept of using paths to navigate hierarchical structures is well known. Just think of a file system on your personal computer or any Linux or UNIX machine. Most file systems have a root directory, which has subdirectories, which in turn have other subdirectories, and so on. This defines a hierarchy of directories or folders. In a Windows file system the path C:\WINDOWS\system32\drivers points to a specific folder in that hierarchy. A path in a UNIX file system might be /home/sqllib/samples/xml. The path consists of multiple steps that are delimited by the / character. The command cd /home/sqllib/samples/xml takes you to this directory, which then becomes your current directory. We can also call it the current context. From there you can use the command cd ../java/jdbc to navigate to the directory /home/sqllib/samples/java/jdbc. The .. takes you to the parent of the current directory, and from there the navigation continues down into the java/jdbc directory. In the examples in the following sections you will find that using XPath to navigate the trees formed by the nested elements and attributes of XML documents works in a strikingly similar manner. Just like a path in a file system, an XPath expression consists of several steps that are separated by the slash (/) character. Each step typically navigates to another level of the hierarchy. XPath also has the notion of a current context and also uses two dots (..) to indicate parent step navigation. Unlike directories in a file system, XML elements can have multiple child elements with the same name. 6.4.2 Simple XPath Queries The XPath expression in Figure 6.8 starts its navigation at the root element customerinfo of every XML document. From there it navigates to the name element, which is an immediate child of customerinfo. The nodes identified by the last step of the XPath expression are considered the result (or value) of the expression. Hence, the name elements that are children of customerinfo elements are returned from each document in the input sequence. Note that the second input document also contains a name element that is not returned because it is a child of the assistant element, which does not match the XPath expression in this example. The element names customerinfo and name in this path expression effectively serve as so-called node tests. This means that at each step the path expression only considers nodes that match the given element or attribute name. XPath: Output: Figure 6.8 /customerinfo/name <name>Robert Shoemaker</name> <name>Matt Foreman</name> XPath that returns name elements If the first step of the XPath in Figure 6.8 was customer instead of customerinfo, as shown in Figure 6.9, no nodes would be identified or returned. This is because the input data does not
134 Chapter 6 Querying XML Data: Introduction and XPath contain a customer element at the beginning of any document. Consequently, there can be no name element that is a child of customer. When an XPath expression returns an empty result unexpectedly, a common reason is that tag names in the path are misspelled. Note that tag names are case-sensitive, which means that /Customerinfo/Name would also return an empty result for the sample data. XPath: Output: Figure 6.9 /customer/name Incorrect element name in an XPath returns an empty sequence If you want to return the customer names without the XML tags <name> and </name>, you need to explicitly navigate to the text node under the name element (see Figure 6.10). This query returns a sequence of two text nodes. Although text() looks like a function, it is not and never takes an argument. It is actually a node test that selects text nodes. XPath: Output: Figure 6.10 /customerinfo/name/text() Robert Shoemaker Matt Foreman XPath that returns text nodes of name elements The same effect as in Figure 6.10 can be achieved with the string() function or the data() function, as illustrated in Figure 6.11. XPath: Output: Figure 6.11 /customerinfo/data(name) Robert Shoemaker Matt Foreman XPath that returns the atomic values of name elements The data() function computes the value of its argument and returns atomic values rather than text nodes. In many cases, such as in Figure 6.10 and Figure 6.11, this makes no difference to your application. The real difference between text() and data() or string() becomes apparent when applied to non-leaf elements, such as addr. A non-leaf element is one that contains one or more child elements. The query in Figure 6.12 returns the empty sequence because there are no text nodes that are immediate children of addr. The XPath in Figure 6.13, however, returns the string value of each addr element. The string value of an element is defined as the concatenation of all text nodes that appear in the subtree under the element. The concatenation does not insert spaces or other delimiters. XPath: Output: Figure 6.12 /customerinfo/addr/text() Trying to retrieve text nodes from an element without text nodes
6.4 Introduction to XPath XPath: Output: Figure 6.13 135 /customerinfo/string(addr) 845 Kean StreetAuroraOntarioN8X 7F8 1596 BaselineTorontoOntarioM3Z 5H9 Obtaining the string value of a non-leaf element The query in Figure 6.13 can also use the function data() instead of string(). The differences between string() and data() are subtle and not always relevant. For example, data() can take a sequence of multiple items as input but string() cannot. If the input argument is an empty sequence, data() returns an empty sequence but string() returns a string of length zero. If you let an XPath expression point to a non-leaf node without using a function such as text(), data(), or string(), the node is returned together with all the child nodes that it contains. This is shown in Figure 6.14 where the XPath returns a sequence of two addr nodes, both of which contain other nodes. Effectively, this allows you to extract entire fragments (subtrees) from your XML documents. XPath: Output: Figure 6.14 /customerinfo/addr <addr country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <addr country="Canada"> <street>1596 Baseline</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M3Z 5H9</pcode-zip> </addr> XPath that returns document fragments Besides the four child elements street, city, prov-state, and pcode-zip, the addr element also has an attribute, country. The XPath language requires the use of the @ sign to distinguish attributes from elements in path expressions. The XPath in Figure 6.15 returns the values of the country attributes. The function data() or string() is required in order to return attribute values in a query result. XPath: Output: Figure 6.15 /customerinfo/addr/data(@country) Canada Canada Using the data() function to return attribute values
136 Chapter 6 Querying XML Data: Introduction and XPath Note that /customerinfo/addr/@country/text() would produce an empty result because unlike elements, attributes never have separate text nodes. While text() produces text nodes, data() and string() produce atomic values. If you omit the data() function in Figure 6.15 and try to use /customerinfo/addr/@country to return the attributes to your application, the XPath fails with an error (SQL16075N). This is because without the data() function it tries to return the complete attribute nodes rather than just their atomic value. But, attribute nodes can never be returned on their own. They always have to be within an element. However, their values can be returned. The difference between the attribute node and its value is that a node is a more complex entity that has properties such as a node name, node kind, a value, and possibly a namespace. The query Figure 6.16 returns the text nodes of all phone elements that are immediate children of a customerinfo element. The result is a sequence of five items, three from the first and two from the second of the input documents. XPath: Output: Figure 6.16 /customerinfo/phone/text() 905-555-7258 416-555-2937 905-555-8743 905-555-4789 416-555-3376 Returning multiple text nodes from each document This example also illustrates that each step in a path expression actually produces a sequence of socalled context nodes, which are input to the next step. The first step of the XPath in Figure 6.16 is /customerinfo. For the input data, this step produces two customerinfo nodes, one from the first and one from the second input document. The next step is /phone, which is executed once for each of the two customerinfo nodes. For the first customerinfo node, the /phone step produces three phone elements and for the second customerinfo node it produces two phone elements. This makes a total of five nodes. The entire sequence of five nodes is input to the last step, /text(). This step is executed once for each of the five nodes and produces their text nodes. Since each of the phone elements has exactly one text node, the final cardinality of the result is five. Figure 6.17 shows the result of a query that returns full phone elements and not just their text nodes. XPath: Output: Figure 6.17 /customerinfo/phone <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> <phone type="work">905-555-4789</phone> <phone type="home">416-555-3376</phone> Returning multiple elements from each document
6.5 6.5 How to Execute XPath in DB2 137 HOW TO EXECUTE XPATH IN DB2 All the XPath queries in this chapter can be executed in DB2. Both DB2 for z/OS and DB2 for Linux, UNIX, and Windows allow XPath expressions to be embedded in SQL. This is called SQL/XML and is explained in Chapter 7, Querying XML Data with SQL/XML. DB2 for Linux, UNIX, and Windows additionally supports XQuery as a stand-alone language without any SQL required. All you need is a table with a column of type XML that contains one or multiple XML documents. Let’s assume that you have the following table with an XML column named info, and that it has two rows containing the two documents shown in Figure 6.7. CREATE TABLE customer(id INTEGER, info XML) To run XPath queries against the two documents in the table, you can use the same XPath expression as in the previous examples—with two additions. First, any XQuery or XPath query in DB2 for Linux, UNIX, and Windows starts with the keyword xquery to indicate that it’s not an SQL statement. This keyword can be upper- or lowercase. Second, you need to reference the XML column as the starting point (context) for the XPath. For this purpose DB2 offers the function db2-fn:xmlcolumn(), which takes an XML column name as input and produces the sequence of all documents in that column as output. The column name must be qualified by a table or view name, which can optionally be qualified by an SQL schema name: db2-fn:xmlcolumn('SQLSCHEMA.TABLENAME.XMLCOLUMNNAME') Using the table and column name as well as the xquery keyword, you can run your first XPath query in the DB2 Command Line Processor (see Figure 6.18). This query simply returns the result of the db2-fn:xmlcolumn() function; that is, the sequence of all XML documents in the info column. Remember that every XQuery takes a sequence of items as input and produces another sequence of items as output. In this simple example, the input and output sequences are identical. The query returns a single column of type XML and two rows, one row for each of our two sample documents. xquery db2-fn:xmlcolumn('CUSTOMER.INFO') Figure 6.18 Executing XPath in DB2 for Linux, UNIX, and Windows To verify that the query in Figure 6.18 returns a column of type XML you can describe it, just like you would describe any SQL statement to check the number, names, and data types of the columns in the result set (see Figure 6.19). The length of the XML data type is zero, because the XML type is a hierarchical data format and there is no notion of length associated with a tree. In contrast, the length of an INTEGER is 4 bytes, and that of a VARCHAR(100) is 100 bytes.
138 Chapter 6 Querying XML Data: Introduction and XPath db2 => describe xquery db2-fn:xmlcolumn('CUSTOMER.INFO') Column Information Number of columns: 1 SQL type ------------988 XML Type length ----------0 Column name --------------------INFO Name length ----------4 db2 => Figure 6.19 Describing an XQuery You can run the query in Figure 6.18 in the DB2 Command Line Processor (CLP) or any other interface, such as the Command Editor that’s part of the DB2 Control Center, IBM Data Studio, or, for example, via JDBC from a Java application. When the XML type data is returned from the DB2 server to any such client it is automatically serialized; that is, converted from DB2’s internal tree format to XML text. The CLP displays at most 4,000 bytes of XML text per row. Any XML column values shorter than this are padded with blanks. Any XML data beyond 4,000 bytes per row is truncated in the CLP display. To avoid truncation and to see the full XML output, you can use the DB2 EXPORT utility (see Chapter 5, Moving XML Data) or a tool such as IBM Data Studio. The table and column name in the db2-fn:xmlcolumn() function must be enclosed in either single quotes or double quotes. They typically also need to be in uppercase. This is because DB2 table and column names default to uppercase, unless you use quotes in the CREATE TABLE statement to force a lowercase table or column name. Now that you are familiar with the mechanics of running XPath in DB2, let’s run the XPath expression previously shown in Figure 6.17. Simply append the path /customerinfo/phone to the db2-fn:xmlcolumn() function, as shown in Figure 6.20. The result is exactly the same as in Figure 6.17. db2 => xquery db2-fn:xmlcolumn('CUSTOMER.INFO')/customerinfo/phone <phone <phone <phone <phone <phone type="work">905-555-7258</phone> type="home">416-555-2937</phone> type="cell">905-555-8743</phone> type="work">905-555-4789</phone> type="home">416-555-3376</phone> 5 record(s) selected. db2 => Figure 6.20 Executing the query from Figure 6.17 in the DB2 Command Line Processor
6.5 How to Execute XPath in DB2 139 Remember that each step in a path expression produces a sequence of so-called context nodes that are input to the next step. In the same manner, the db2-fn:xmlcolumn() function produces a sequence of XML documents that are input to the first step of the XPath expression. Hence, the XPath /customerinfo/phone is evaluated once for each document in the table. The result items from all documents, in this case phone elements, are combined into a single sequence. Each item is returned to the client as a separate row. DB2 also offers the function db2-fn:sqlquery(), which is similar to db2-fn:xmlcolumn(). While db2-fn:xmlcolumn() takes an XML column name as input and produces the sequence of all documents in that column as output, the function db2-fn:sqlquery() takes an SQL query as input and produces as output the sequence of documents that are returned by that SQL statement. This SQL query can be any query, even with joins and subselects and so on, as long as it returns a single column of type XML. Figure 6.21 is a simple example of a query that returns a sequence of documents that are a subset of the documents in the XML column info. xquery db2-fn:sqlquery("SELECT info FROM customer WHERE id > 1003") Figure 6.21 Producing a sequence of documents with an SQL query The key difference between db2-fn:xmlcolumn() and db2-fn:sqlquery() is that db2fn:xmlcolumn() takes all documents in an XML column as the input for your XPath expression, while db2-fn:sqlquery() allows you to use relational predicates and so on to pre-filter the set of documents that are input to the XPath query. The embedded SQL statement is parsed by DB2’s SQL parser, which means that table and column names are automatically converted to uppercase. You can append any path expression to the db2-fn:sqlquery() function to further process the returned documents. In Figure 6.22, the XPath expression /customerinfo/phone is applied to the one XML document that is identified by the embedded SQL statement. db2 => xquery db2-fn:sqlquery("select info from customer where id = 1003")/customerinfo/phone <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> 3 record(s) selected. db2 => Figure 6.22 Using db2-fn:sqlquery in the DB2 Command Line Processor
140 Chapter 6 Querying XML Data: Introduction and XPath You can run any XPath expression that you see in this chapter simply by appending it to the db2fn:xmlcolumn() or db2-fn:sqlquery() functions and using the xquery keyword, as illustrated in the preceding figure. In the following sections we explain further features of the XPath language and provide more examples. All of them can be run in DB2 for Linux, UNIX, and Windows just like you see in Figure 6.20 and Figure 6.22. 6.6 WILDCARDS AND DOUBLE SLASHES XPath allows the use of the * as a wildcard character to match any element name, and @* to match any attribute name. The XPath expression in Figure 6.23 uses the wildcard to return all elements that are immediate children of the assistant element. The assistant element occurs only in the second of the two documents and has two child elements, name and phone. XPath: Output: Figure 6.23 /customerinfo/assistant/* <name>Gopher Runner</name> <phone type="home">416-555-3426</phone> Using a wildcard to select all child elements of assistant The wildcard in the XPath expression in Figure 6.24 matches all elements that occur directly under customerinfo. These are the elements name, addr, phone and in the second document also assistant. The sequence of these elements is input to the last step of this XPath, /name. In other words, the XPath then tries to find /customerinfo/name/name, /customerinfo/ addr/name, /customerinfo/phone/name, and /customerinfo/assistant/name. The first three of these don’t exist and so only the assistant’s name is returned. XPath: Output: Figure 6.24 /customerinfo/*/name <name>Gopher Runner</name> Using a wildcard to match any child element of customerinfo The query in Figure 6.25 uses two wildcards, one to match any element at the second level of the document hierarchy and one to match any element at the third level. The first wildcard matches name, addr, phone, and assistant, as in the previous example. The next wildcard then matches any child elements of these nodes. Only addr and assistant have child elements and all of those are returned. The last two elements in the result, name and phone, are children of assistant, which exists only for one of the two input documents. Customer phone elements are not included in the result, because they are at the second instead of the third level of the document. The XPath expression /*/*/* would return the same result from the sample data.
6.6 Wildcards and Double Slashes XPath: Output: Figure 6.25 141 /customerinfo/*/* <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> <street>1596 Baseline</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M3Z 5H9</pcode-zip> <name>Gopher Runner</name> <phone type="home">416-555-3426</phone> Using wildcards to return any element on the third level of the document While * matches any element name, @* matches any attribute. The XPath in Figure 6.26 is similar to the one in Figure 6.25, but it returns any attribute at the third level of the documents because it uses @* instead of * in the last step of the path expression. Additionally, the data() function is used to return just the value of each attribute node. The sample data contains two attributes on the third level of the document, /customerinfo/addr/@country and /customerinfo/phone/ @type. The addr and phone elements are matched by the * in the second step of the XPath, and their attributes are matched by @* in the third step. Attributes of the assistant phone elements are not returned because they are at the fourth level. XPath: Output: Figure 6.26 /customerinfo/*/data(@*) Canada work home cell Canada work home Using wildcards to return any attribute on the third level of the document The examples clarify that a * is a wildcard for a tag name at a very specific level of the XML documents, and you need to use multiple wildcards to match arbitrary tags at multiple levels. Another XPath construct that makes queries more general is the double slash (//). You can use it to reach descendants at any level in a document tree. An example is shown in Figure 6.27. The difference between a single slash (/) and a double slash (//) is that a / navigates exactly one level further down in the document tree while a // navigates any number of levels down the tree. In other words, a / navigates to an immediate child node while a // navigates to all descendant nodes. Descendant nodes include child nodes, grandchild nodes, great-grandchild nodes, and so on.
142 Chapter 6 Querying XML Data: Introduction and XPath The XPath expression in Figure 6.27 consists of two steps. The first step navigates to the top-level element customerinfo. All customerinfo nodes are input (context) for the second step. The second step, //name, looks for name elements at any level in the document tree under a customerinfo node. It finds two name elements at the second level, /customerinfo/name, and one name element at the third level, /customerinfo/assistant/name. XPath: Output: Figure 6.27 /customerinfo//name <name>Robert Shoemaker</name> <name>Matt Foreman</name> <name>Gopher Runner</name> Selecting name elements at any level under customerinfo Figure 6.27 shows some of the benefits and some of the dangers of the //. A benefit is that the // allows you to easily navigate to all occurrences of a certain element, even if that element occurs at multiple different levels of a document tree. Another benefit can be that it allows you to find a certain element in the documents even if you do not know its exact position and therefore are unable to write a fully qualified XPath. A danger of the // can be that it might select more data than you actually intended. If the goal of the query in Figure 6.27 was to retrieve customer names only, then the result leads you to believe that there are three customers and that Gopher Runner is one of them. This is incorrect because Gopher Runner is the assistant to Matt Foreman and not a customer himself. Another disadvantage of the // is that it doesn’t specify a direct path to the desired nodes. This causes an XPath processor, such as DB2, to search exhaustively through potentially large portions of a document. For example, the query in Figure 6.27 requires DB2 to navigate into the addr branch of each document and examine each child element of addr to determine whether its element name is name. A fully specified path without // avoids this overhead and yields better performance. The // can also be used at the beginning of a path expression, such as //name, which for the sample data returns the same result as the query in Figure 6.27. The XPath //* returns all elements from all input documents, because // navigates to any level of the document and * matches any element at each of those levels. Similarly //data(@*) returns all attribute values anywhere in the documents, and //text() returns all text nodes. Use such general expressions with caution. 6.7 XPATH PREDICATES The preceding XPath examples always return all matching nodes from the input documents. In many cases it is desirable to use search conditions (predicates) to filter the data and only return selected items. In XPath, predicates are always enclosed in square brackets and can appear in any
6.7 XPath Predicates 143 step of the path. In Figure 6.28, a predicate in square brackets is applied to the customerinfo element, which is the first step of the path. Roughly speaking, this query returns the name of the customer(s) whose Cid attribute is 1004. More precisely, the predicate checks for each customerinfo element in the input data, whether the element has an attribute by the name of Cid and whether the value of that attribute is 1004. If such a Cid attribute does not exist or if its value is not 1004, the respective customerinfo element is excluded from further consideration. Based on our input data, only the customerinfo element in the second document passes this test. This element is now the context for the next steps of the navigation, /name/text(), and the value Matt Foreman is returned. XPath: Output: Figure 6.28 /customerinfo[@Cid=1004]/name/text() Matt Foreman Numeric predicate in an XPath expression Instead of the equality comparison you can also use less than (<), greater than (>), less than or equal (<=), greater than or equal (>=), and not equal (!=). More details on comparison operators are provided in section 6.8. In Figure 6.29, the predicate in square brackets is applied to the addr element to return the streets of those customers who live in Toronto. If an addr element has a child element city whose value is Toronto, the addr element is used as the context for the next navigation step, /street. XPath: Output: Figure 6.29 /customerinfo/addr[city="Toronto"]/street <street>1596 Baseline</street> String predicate in an XPath expression Remember that the value of an element is defined as the concatenation of all text nodes in the subtree underneath that element (see section 3.1, Understanding XML Document Trees). Since the city element has only a single text node, the predicates [city="Toronto"] and [city/ text()="Toronto"] lead to the same result. Hence, in the vast majority of cases you do not need to use /text() in predicates. The relatively rare case in which it can sometimes be useful to use /text() in predicates is when the immediate children of an element are a mix of element and text nodes. Such elements are said to have mixed content (see section 3.1). If you want to return the city element instead of the street element, a possible XPath is /customerinfo/addr[city="Toronto"]/city. The city element is referenced once to evaluate the predicate and then a second time at the end of the path to return it.
144 Chapter 6 Querying XML Data: Introduction and XPath NUMERIC VERSUS STRING COMPARISON Note that the predicate [@Cid=1004] performs a numeric comparison while the predicate [@Cid="1004"], with double quotes around the literal value, performs a string comparison. The difference between numeric and string comparison can lead to different query results. For example, a string comparison would find that the string values “1E3” and “1000” are not equal. But, a numeric comparison would confirm that the numbers 1E3 and 1000 are equal because 1E3 is the exponential notation for 1000. Similarly, the string comparison “2” < “10” is false, but the numeric comparison 2 < 10 is true. Note also that the numeric comparison [@Cid=1004] fails with an error (SQL16061N) at runtime if a document is encountered where the value of the Cid attribute is not a number. A predicate expression within the square brackets can contain multiple steps to navigate to the element or attribute whose value you want to check. For example, say you want to return the name of all customers in Toronto. To develop this XPath expression from scratch, first start without the predicate and write down just the path to the element that you want to return: /customerinfo/name To restrict the result to customers in Toronto, a predicate on the city element is required. The city element is a child of the addr element, which in turn is a child of customerinfo, so this is where you need to apply the predicate: /customerinfo[addr/city ="Toronto"]/name The predicate [addr/city ="Toronto"] checks for each customerinfo element if it has a child element addr that has a child element city whose value is Toronto. The customerinfo nodes that fulfill this condition are then the input for the next step, /name. In other words, the XPath step right after the predicate is /name and it continues navigation based on the element before the predicate (customerinfo) and not based on any element inside the square brackets. This is illustrated in Figure 6.30, where this XPath expression is shown with two branches. The horizontal branch identifies the items that are to be returned (/customerinfo/name), and the branch in the dotted box is the predicate. addr customerinfo Figure 6.30 city = "Toronto" name Visualization of an XPath with a predicate
6.7 XPath Predicates 145 One XPath can contain multiple predicates, as illustrated in Figure 6.31, which returns the street of the customer whose name is Matt Foreman and whose city is Toronto. XPath: Output: /customerinfo[name="Matt Foreman"]/addr[city="Toronto"]/street <street>1596 Baseline</street> Figure 6.31 XPath with two predicates When writing such a query from scratch, proper placement of the predicates is sometimes not obvious if you are new to XPath. The recommendation is again to first write the XPath without any predicates and only navigate to the element that you want to return (street). This simpler XPath looks like this: /customerinfo/addr/street Now you can add filtering predicates for name and city. Since name is a child element of customerinfo, insert a pair of square brackets right after customerinfo for the predicate: /customerinfo[name="Matt Foreman"]/addr/street The city element is a child of addr, so the square brackets for the second predicate come right after addr in the path expression, and this completes the query in Figure 6.31: /customerinfo[name="Matt Foreman"]/addr[city="Toronto"]/street Again, visualizing this query as a branching expression might be helpful (see Figure 6.32). name = "Matt Foreman" city = "Toronto" customerinfo Figure 6.32 addr street Visualization of an XPath with two predicates Note that a predicate expression in square brackets can contain a / or a // but typically never starts with a / or a //. Consider the following XPath expression as an example: /customerinfo[/name="Matt Foreman"]/addr/street This XPath returns the empty sequence because the predicate [/name="Matt Foreman"] does not use the current customerinfo element as context. That is, it does not look for name elements that are children of customerinfo. Instead, the / inside the square brackets causes it to
146 Chapter 6 Querying XML Data: Introduction and XPath restart navigation at the very top of each document, but there is no document in the sample data where the topmost element is name. Figure 6.33 shows what can happen if you use // right at the beginning of a predicate expression in square brackets. The intention of this query was to return all cell phones by looking at type attributes anywhere under phone. However, the // inside the square brackets causes it to restart navigation at the very top of each document. Hence, the actual meaning of this query is: Retrieve all phone elements from a document if a type attribute with the value “cell” occurs anywhere in the document. In other words, return all phone elements if one of them is a cell phone. XPath: Output: /customerinfo/phone[//@type="cell"] <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> Figure 6.33 Incorrect use of // in a predicate If you know that the type attribute is a child of phone, you could simply remove the // from the beginning of the predicate expression. Otherwise you can use a dot to force the // to only search within the subtree (within the current context) of the respective phone element (see Figure 6.34). The current context is explained in more detail in section 6.10. XPath: Output: /customerinfo/phone[.//@type="cell"] <phone type="cell">905-555-8743</phone> Figure 6.34 Correct use of // in a predicate Also note that the opening square bracket of a predicate can never follow immediately after a / or a //. For example, the XPath /customerinfo/[name="Matt Foreman"] would fail with an error (SQL16002N). A / starts a new step, which cannot begin with a predicate. A predicate always has to be preceded by a context node (such as an element name) to which it is applied. And finally, look at Figure 6.35, which uses an equality comparison without square brackets. This is just a Boolean expression of the form A = B that returns either true or false. It is not a useful predicate to select specific parts of the customer data. In particular, this query does not return the customer whose name is Matt Foreman. The query examines a sequence of name elements and returns true if at least one of them is equal to Matt Foreman. This is called existential semantics and is explained in the next section. XPath: Output: /customerinfo/name="Matt Foreman" true Figure 6.35 A Boolean expression, not a filtering predicate
6.8 6.8 Existential Semantics 147 EXISTENTIAL SEMANTICS When you use XPath, existential semantics (also known as existential quantification) is applied automatically all the time. Roughly speaking, existential semantics means that the existence of at least one matching node is sufficient for a predicate to evaluate to true. Let’s look at the query in Figure 6.36 as an example. This query returns the name of those customers whose phone number is 416-555-2937. But, both of the input documents contain several occurrences of the phone element. Existential semantics means that the query in Figure 6.36 returns name elements that are children of customerinfo elements that contain at least one child element phone whose value is 416-555-2937. The existence of at least one matching phone element is sufficient to fulfill the predicate. Existential semantics is a useful concept for querying XML data, because it defines how to evaluate predicates on repeating elements (or more generally, on sequences of two or more items). XPath: Output: Figure 6.36 /customerinfo[phone="416-555-2937"]/name <name>Robert Shoemaker</name> At least one phone element must match, not all of them Figure 6.37 shows another example of existential semantics. It includes a predicate that contains nothing but the element name assistant. The predicate evaluates to true if this element exists at the indicated position in the document tree; that is, as a child of the customerinfo element. As a result, this query returns the name of those customers who have an assistant, no matter what the assistant name or phone number is. The mere existence of an assistant element is what this predicate is looking for. Such a predicate is called a structural predicate as opposed to a value predicate, which performs a value comparison. XPath: Output: Figure 6.37 /customerinfo[assistant]/name <name>Matt Foreman</name> A structural predicate Similarly you can check for the existence of an attribute. The query in Figure 6.38 retrieves the names of all customers who have a country attribute in the addr element. XPath: Output: Figure 6.38 /customerinfo[addr/@country]/name <name>Robert Shoemaker</name> <name>Matt Foreman</name> Return the name if a country attribute exists Yet another example of existential semantics is illustrated in Figure 6.39 where the right side of the predicate is a sequence of two atomic values. This predicate is true if there is at least one value in this sequence that is equal to the value of the city element. If you are familiar with IN-list queries in SQL, this is how you can do the same in XPath.
148 Chapter 6 XPath: Output: Figure 6.39 Querying XML Data: Introduction and XPath /customerinfo[addr/city = ("Toronto","Aurora")]/name <name>Robert Shoemaker</name> <name>Matt Foreman</name> Predicate is true if at least one of the values matches What if a customer has several addresses so that addr/city evaluates to a sequence of multiple city elements? In this case, existential semantics defines that the predicate is true if at least one of these city elements is equal to at least one of the values on the right side. Let’s look at the two sequences (1,2,3,4) and (7,8,2). The comparison (1,2,3,4) = (7,8,2) evaluates to true because there is at least one item in the first sequence that is equal to at least one item in the second sequence. This item is the number 2. What might seem counterintuitive at first is that the predicate (1,2,3,4) != (7,8,2) also evaluates to true! This is again due to existential semantics, because there is at least one item in the first sequence that is not equal to at least one item in the second sequence. Figure 6.40 shows the corresponding behavior for the sample data. Remember that Robert Shoemaker lives in Aurora and Matt Foreman lives in Toronto (see Figure 6.7). The XPath in Figure 6.40 returns Robert Shoemaker’s name because his city (Aurora) is not equal to at least one item in the sequence on the right (Toronto). The same applies to Matt Foreman whose city (Toronto) is not equal to Aurora. XPath: Output: Figure 6.40 /customerinfo[addr/city != ("Toronto","Aurora")]/name <name>Robert Shoemaker</name> <name>Matt Foreman</name> Predicate is true if at least one of the values does not match The lesson here is that XPath’s existential semantics is not only applied to equality predicates but also to range and inequality predicates for which the behavior is not immediately intuitive if the left side or the right side evaluates to a sequence of more than one item. For example, the predicate in Figure 6.41 only involves sequences of exactly one item on either side of the != operator. The behavior is intuitive and only Robert Shoemaker’s name is returned because he is the only customer in our sample who does not live in Toronto. XPath: Output: Figure 6.41 6.9 /customerinfo[addr/city != "Toronto"]/name <name>Robert Shoemaker</name> Not-equal predicate on single items LOGICAL EXPRESSIONS WITH AND, OR, NOT() Similarly to SQL, XPath allows you to build more complex predicates with and, or, and not(). While and and or are logical operators, not() is a function that reverses the Boolean value of its argument. XPath and XQuery are case-sensitive languages and all operators and functions have to be written in lowercase.
6.9 Logical Expressions with AND, OR, NOT() 149 The query in Figure 6.42 uses the or operator to check whether there is an addr with a city element that has the value Toronto, or if there is an addr with a city element whose value is Aurora. For the sample data, this returns the same result as in Figure 6.39. Note that when we say “if there is” or “if there exists” we are hinting at the fact that existential semantic is always at play. XPath: Output: Figure 6.42 /customerinfo[addr/city = "Toronto" or addr/city ="Aurora"]/name <name>Robert Shoemaker</name> <name>Matt Foreman</name> Disjunction of predicates (or-’ing) The and operator is used in Figure 6.43 to select the names of customer whose city is Aurora and whose country is Canada. XPath: Output: Figure 6.43 /customerinfo[addr/city = "Aurora" and addr/@country = "Canada"]/name <name>Robert Shoemaker</name> Conjunction of predicates (and-’ing) The predicate in Figure 6.43 checks whether there is an addr element with a city child that has the value Aurora, and whether there is also an addr element with a country attribute whose value is Canada. In this case, both conditions are fulfilled by one and the same addr element. In general, however, they could be fulfilled by two different addr elements; for example, if a customer had two addresses. This alludes to the next interesting example. You might write the query in Figure 6.44 to find a customer whose work phone number is 416555-2937. Such a customer does not exist in our sample data, because 416-555-2937 is Robert Shoemaker’s home phone number, not his work phone number. The predicate restricts the value of the phone element to 416-555-2937, and the type attribute of the phone element to work. Still, the name Robert Shoemaker is returned. This is because existential semantics applies to both parts of the predicate. The first part of the predicate, phone = "416-555-2937", is true because there is a phone element whose value is 416-555-2937. The second part of the predicate, phone/@type = "work", is also true because there also is a phone element whose type is work. But, these two phone elements are not the same. The query result in Figure 6.44 is perfectly correct according to the existential semantics of XPath, but probably not what you wanted to achieve with this query. XPath: Output: Figure 6.44 /customerinfo[phone = "416-555-2937" and phone/@type = "work"]/name <name>Robert Shoemaker</name> Two predicates matched by different phone elements!
150 Chapter 6 Querying XML Data: Introduction and XPath To solve this issue you need to express the predicate such that both conditions are applied to the same phone element. One way of doing this is shown in Figure 6.45 where nested square brackets are used. The outer square brackets describe a predicate that is applied to the customerinfo elements. This predicate says that a customerinfo element should only be considered if a certain phone element exists among its children. The inner square brackets are used to further constrain these phone elements by applying a predicate to them. The inner predicate [text() = "416-555-2937" and @type = "work"] says that the text value of the phone element has to be 416-555-2937 and the type of the same phone element is work. Both parts of this inner predicate are always applied together to the same phone element. Since no such customer exists in our sample data, the correct result of the query is empty. XPath: /customerinfo[phone[text() = "416-555-2937" and @type = "work"] ]/name Output: Figure 6.45 Nested predicates Figure 6.46 provides another example of the use of the or operator. It returns the names of the customers who have an assistant or a cell phone. Both of the customers are returned because one of them has a cell phone and the other has an assistant. XPath: Output: Figure 6.46 /customerinfo[assistant or phone/@type="cell"] ]/name <name>Robert Shoemaker</name> <name>Matt Foreman</name> A structural predicate and a value predicate The XPath expression in Figure 6.47 lists the names of those customers who don’t have an assistant. The not() function is used in the predicate to qualify the customerinfo elements that do not have a child element with the name assistant. XPath: Output: Figure 6.47 /customerinfo[not(assistant)]/name <name>Robert Shoemaker</name> Checking for the non-existence of an element Next, let’s look at the following pair of queries (see Figure 6.48 and Figure 6.49) to clarify the difference between using the not() function and the “not equal” comparison operator (!=). Due to existential semantics, the query in Figure 6.48 returns the names of both customers. This is because both of them have at least one phone number that is not equal to 416-555-2937. One such non-matching phone element is enough to fulfill the predicate, even if other phone elements exist that do match this number.
6.10 The Current Context and the Parent Step 151 The query in Figure 6.49 returns a result that might be more desirable: the name of the customer who does not have any phone element with the value 416-555-2937. The equality predicate inside the not() function is subject to existential semantics; that is, at least one phone element with this specific number has to exist. The outcome of this test is then negated with the not() function. In other words, the two queries differ because • The query in Figure 6.48 checks whether there is at least one phone that is not equal to 416-555-2937 (even if other phone elements are equal to this value). • The query in Figure 6.49 checks whether there is not at least one phone that is equal to 416-555-2937 (that is, there is no phone that is equal to this value). /customerinfo[phone != "416-555-2937"]/name <name>Robert Shoemaker</name> <name>Matt Foreman</name> XPath: Output: Figure 6.48 /customerinfo[not(phone = "416-555-2937")]/name <name>Matt Foreman</name> XPath: Output: Figure 6.49 6.10 Predicate is true if at least one phone element does not match Predicate is true if none of the phone elements match THE CURRENT CONTEXT AND THE PARENT STEP You probably know that in a file system the dot (.) denotes the current location in the file system, and two dots (..) refer to the parent directory. The same notation exists in XPath to refer to the current node when navigating a document tree, or to the parent of the current node. This is illustrated in Figure 6.50, which shows four versions of an XPath expression. All of them return the same result from our input data; that is, the name element of the customers who live in Aurora. For the discussion of these four XPath expressions you may want to refer to the document tree shown in section 3.1, Understanding XML Document Trees. Also, remember that the node name right before the square brackets of a predicate determines the input to the predicate and to the step that immediately follows the predicate. For example, XPath (a) in Figure 6.50 first produces a sequence of customerinfo elements. For each of these customerinfo elements the predicate checks whether there is an addr element that has a child element city whose value is Aurora. If so, the respective customerinfo element is input to the final step, /name, which returns the child element name. XPath (b) is different because the predicate is applied to addr, not to customerinfo. Hence, this XPath first produces a sequence of addr elements, which are input to the predicate. Any addr element that has a child element city with value Aurora is then input to the subsequent
152 Chapter 6 Querying XML Data: Introduction and XPath step after the predicate. Since we want to return name elements, we need to navigate from addr to name, which are siblings in our documents. Because an XML document tree has no direct links between siblings, we use the parent step (..) to go one level up in the tree to their common parent, and from there to name. XPath (a) (b) (c) (d) Output: Figure 6.50 /customerinfo[addr/city = "Aurora"]/name /customerinfo/addr[city = "Aurora"]/../name /customerinfo/addr/city[. = "Aurora"]/../../name /customerinfo/name[../addr/city = "Aurora"] <name>Robert Shoemaker</name> Four different ways to write a predicate and return the name element In XPath (c) the predicate in square brackets is applied to city, which means that this XPath first produces a sequence of city elements, which are used as input (as context nodes) to the predicate. The predicate [. = "Aurora"] uses the dot to refer to the current context, which in this case is always a city element. Any city element for which the predicate is true is then input (context) for the subsequent navigation after the predicate. If you want to return name elements, you need to navigate from city to name, which are in different branches of the document. Hence you need to navigate via the nearest common ancestor, which is customerinfo. Since city is a grandchild of customerinfo, you need to go two levels up in the tree (/../..) before you can reach the name element (/name). XPath (d) is different from (a), (b), and (c) because there is no /name step after the predicate. Instead, XPath (d) first navigates from customerinfo to name to produce a sequence of name elements. The square brackets are applied to name, to filter the names that get returned. The predicate [../addr/city = "Aurora"] means that a name element is returned only if it has a parent that has a child element addr that has a child element city whose value is Aurora. XPath (a) is the most preferable path expression among the four options in Figure 6.50, because it avoids parent steps completely. Avoiding parent steps is good for performance and keeps queries easy to understand. Figure 6.51 shows four more XPath expressions. All of them return empty results because their navigation doesn’t correspond to the structure of the sample data. The parent step in XPath (a) is incorrect for the sample data because it navigates from customerinfo to name with an intermediate parent step as if name was a sibling of customerinfo, which is not the case. XPath (b) tries to return name elements that are children of addr. But, no such name elements exist. Similarly, XPath (c) tries to return name elements that are children of the parent of city (that is, children of addr). Again, no such name elements exist. XPath (d) intends to return name elements that have a child element addr with a city whose value is Aurora. But, this predicate is always false for the sample data because addr is not a child of name.
6.11 Positional Predicates XPath (a) (b) (c) (d) Output: Figure 6.51 6.11 153 /customerinfo[addr/city = "Aurora"]/../name /customerinfo/addr[city = "Aurora"]/name /customerinfo/addr/city[. = "Aurora"]/../name /customerinfo/name[addr/city = "Aurora"] Four different XPath expressions that don’t match the sample data POSITIONAL PREDICATES So far you have used value predicates and structural predicates. Value predicates compare an element or attribute to a literal value such as a string or a number. Structural predicates don’t look at values but at the structure of an XML document by checking for the existence of an element or attribute by name. Positional predicates can be used to select nodes based on the order in which they appear in a document or, more generally, in a sequence. As shown in Figure 6.52, a positional predicate is simply an integer number in square brackets. Both documents in the sample data contain multiple phone elements, but this query only returns the first phone element from each document. XPath: Output: Figure 6.52 /customerinfo/phone[1] <phone type="work">905-555-7258</phone> <phone type="work">905-555-4789</phone> Positional predicate to select the first phone element Similarly, the XPath in Figure 6.53 selects the third phone element under each customerinfo element. In the sample data, the customer Robert Shoemaker has three phone numbers but Matt Foreman has only two phones. Hence, the result only contains Robert’s third phone number and none of Matt’s phone numbers. XPath: Output: Figure 6.53 /customerinfo/phone[3] <phone type="cell">905-555-8743</phone> Positional predicate to select the third phone element To obtain the last phone element from each document irrespective of the number of phone elements in any given document, use the function last() in the predicate. This function takes no arguments but serves as an index to the last item in a sequence (see Figure 6.54).
154 Chapter 6 Querying XML Data: Introduction and XPath /customerinfo/phone[last()] <phone type="cell">905-555-8743</phone> <phone type="home">416-555-3376</phone> XPath: Output: Figure 6.54 Positional predicate to select the last phone element Related to positional predicates is the function position(). It takes no arguments but returns the position of the context item in the sequence that is being processed. For example, the positional predicate [3] is the same as the predicate [position() = 3]. 6.12 UNION AND CONSTRUCTION OF SEQUENCES Most of the XPath examples so far have returned one type of element, such as phone numbers or names. Sometimes it is desirable to obtain multiple different elements or attributes from each document. This can be achieved with the union operator, which is either written as the union keyword or the pipe character: |. The XPath in Figure 6.55 uses the union operator in the last step of the XPath, to combine the street and city elements into a single sequence. The result contains four elements, street and city from each of the two customers in the sample data. You will later use SQL/XML to return the street and city in two separate columns, which can be a more desirable return format (see Chapter 7, Querying XML Data with SQL/XML). XPath: Output: Figure 6.55 /customerinfo/addr/(street|city) <street>845 Kean Street</street> <city>Aurora</city> <street>1596 Baseline</street> <city>Toronto</city> XPath with a union operator The union of sequences is similar to the construction of sequences. The comma is a sequence constructor and in many cases it produces the same result as a union. For example, the XPath /customerinfo/addr/(street,city) returns the same result as the union in Figure 6.55. However, there are a couple of differences between union and construction of sequences. The comma operator allows you to construct sequences from atomic values. The | operator cannot take atomic values as input, it has to take sequences of element or attribute nodes as input. Secondly, the union removes duplicate nodes while the comma operator does not. The de-duplicating of the union is based on node identities, not on node values. This means that two elements are not necessarily considered duplicates just because they have the same element name and value. They are considered duplicates only if they are indeed the same element from the same document.
6.13 General and Value Comparisons 155 In addition to the union operator there is also an intersect and an except operator. The intersect operator produces the nodes that occur in both sequences, and the except operator returns the nodes that are in the first but not the second sequence. 6.13 XPATH FUNCTIONS If you look back at Figure 6.1 at the beginning of this chapter, you see that XPath and XQuery do not only share the same data model but also a common set of functions and operators. Throughout this chapter we have used some of these functions such as data(), string(), and not(). XPath and XQuery provide a large number of built-in functions. These include aggregate functions such as count() and sum(), string functions such as contains() and substring(), as well as numeric and other functions. Figure 6.56, Figure 6.57, and Figure 6.58 provide examples of how to use functions in XPath expressions. The count() function returns the number of nodes produced by the expression that is provided as the function argument. Remember that Robert Shoemaker has three phone numbers and Matt Foreman has two. Other functions such as upper-case() and concat() behave in intuitive ways. XPath: Output: Figure 6.56 XPath: Output: Figure 6.57 XPath: Output: Figure 6.58 /customerinfo/count(phone) 3 2 Return the number of phone elements per document /customerinfo/upper-case(name) ROBERT SHOEMAKER MATT FOREMAN Convert the customer names to upper case /customerinfo/concat(name," – ", addr/city) Robert Shoemaker - Aurora Matt Foreman - Toronto Concatenate the customer name and city Section 8.7, XQuery Functions, contains a more extensive discussion of XPath and XQuery functions. Additionally, Appendix C provides pointers to the complete reference of all supported XPath and XQuery functions in DB2 for z/OS and DB2 for Linux, UNIX, and Windows.
156 6.14 Chapter 6 Querying XML Data: Introduction and XPath GENERAL AND VALUE COMPARISONS All the comparison operators that you have used so far (=, !=, <, <=, >, >=) are called general comparisons because they allow you to compare sequences of zero, one, or multiple items. This is based on existential semantics, as discussed in section 6.8. General comparisons provide a lot of flexibility and serve you well in the vast majority of cases. There are also value comparison operators, such as eq (equal), lt (less than), le (less than or equal), gt (greater than), ge (greater or equal), and ne (not equal). Value comparisons are different from general comparisons because they can only compare single items. For example, /customerinfo/addr[city eq "Toronto"] is a valid value comparison as long as there is only one city element per addr. The query /customerinfo[phone eq "408-463-4963"] will fail at runtime because the sample data contains multiple phone elements per customerinfo. The DB2 error message is SQL16003N An expression of data type "( item(), item()+ )" cannot be used when the data type "item()" is expected in the context. The “( item(), item()+ )” is a regular expression that denotes a sequence of one item followed by one or more items. In total that’s two or more items. So this message is a very formal way of saying that there is a sequence of multiple items (that is, multiple phone elements) when only a single item was allowed. In many cases you can work around this error by writing the XPath expression as /customerinfo/phone[. eq "408-463-4963"] because the dot always refers to exactly one of the phone elements at a time. Another solution is to simply use a general comparison instead: /customerinfo[phone = "408-463-4963"]. Another issue with value comparisons is that they perform string comparisons by default. For example, the XPath /customerinfo/addr[pcode-zip lt 95123] will fail with the following message because it tries to use the lt operator with a numeric value (95123), instead of a string value (“95123”). SQL16003N An expression of data type "xs:integer" cannot be used when the data type "xs:string" is expected in the context. SQLSTATE=10507 You can avoid this error by casting the pcode-zip element to xs:integer, such as [xs:integer(pcode-zip) lt 95123], or by using a general comparison instead. Value comparisons have one property that general comparisons do not have, and that is transitivity. If x eq y and y eq z then you are safe to conclude that y eq z is also true. This is not possible with the existential semantics of general comparisons for sequences. For example, (1,2,3) = (3,4,5) and (3,4,5) = (5,6,7), but (1,2,3) != (5,6,7) because there is no item in (1,2,3) that is equal to any item in (5,6,7).
6.16 Summary 157 In summary, the use of value comparisons opens up various opportunities for errors but in most cases provides little gain. Most applications do not require transitivity and are well-served with general comparisons. One potential benefit of value comparisons is that you can force errors if you want to be alerted when data types or element occurrences are different than what you expect. 6.15 XPATH AXES AND UNABBREVIATED SYNTAX We have introduced XPath through a series of practical examples. In a more formal introduction you might read about XPath axes. An axis is the direction of movement when navigating through a document. DB2 supports the child axis, the descendant axis, the attribute axis, the self axis, the parent axis, and the descendant-or-self axis. We have used all of these axes in the examples in the previous sections of this chapter. For example, the path /customerinfo/addr/@country uses the child axis to navigate from customerinfo to its child element addr, and the attribute axis to navigate from addr to its attribute country. All XPath examples in this book use the so-called abbreviated XPath syntax, because it is simple, easy to understand, and recommended. XPath also offers an unabbreviated syntax, which means that the axes are spelled out explicitly in each step of an XPath. This is rarely used. For example: Abbreviated: /customerinfo/addr/@country Unabbreviated: /child::customerinfo/child::addr/attribute::country Abbreviated: /customerinfo//phone Unabbreviated: /child::customerinfo/descendant-or-self::node()/child::phone In a nutshell, the unabbreviated XPath syntax is verbose, clumsy, and not used much in practice. We recommend that you do not use it. We have explained it here merely so that you recognize it if it ever crosses your path (no pun intended). 6.16 SUMMARY XPath is the fundamental language for traversing XML documents, evaluating XML predicates, and retrieving XML values. A thorough understanding of XPath is a prerequisite for querying XML data in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. Both SQL/XML and XQuery involve XPath. Understanding XPath begins with understanding the XQuery and XPath data model. This data model is inherently different from the relational model. The better you understand the XQuery data model the easier it is for you to write XML queries. Every value in the XQuery and XPath data model is a sequence of zero, one, or multiple items. An item is either an atomic value or a node. Commonly used nodes include document nodes, element nodes, attribute nodes, and text nodes. Element nodes can include child nodes to form hierarchies
158 Chapter 6 Querying XML Data: Introduction and XPath of nodes, such as XML documents. Hence, a sequence of zero, one, or multiple XML documents is a value in the XQuery and XPath data model. A sequence of individual elements, a sequence of integer numbers, and so on are also values in the data model. Every XQuery or XPath query takes a value of this data model as input and produces another value of the data model as output. Most commonly an XPath expression consists of one or multiple steps, separated by a slash (/), where each step is an element name or wildcard. This allows you to navigate into an XML document tree to select specific elements. If you want to select attribute nodes then the last step in a path must be an attribute name that’s preceded by the @ sign. Since an XML document can contain elements that occur multiple times, a single XPath expression may select multiple nodes. At each step an XPath can contain a predicate to restrict the search in the document. XPath predicates must be enclosed in square brackets. The evaluation of XPath expressions and predicates is always based on existential semantics. Roughly speaking, existential semantics means that the existence of at least one matching item is sufficient for a predicate to evaluate to true. This is of particular importance when you query XML documents with repeating elements. Repeating XML elements and existential semantics are some of the most profound differences between the XML world and relational world. In the following chapters you learn how to use XPath in SQL/XML and XQuery.
C H A P T E R 7 Querying XML Data with SQL/XML he SQL language standard includes a variety of functions and features to process XML data. This functionality is commonly referred to as SQL/XML. The SQL/XML functions that allow you to embed XPath and XQuery expressions in SQL are of particular interest. These functions enable you to use familiar SQL statements enriched with XPath expressions to query XML data in a DB2 database. They also facilitate the simultaneous processing of XML and relational data in the same query. This marriage of two worlds, XML and relational, is extremely powerful and versatile. T Although SQL/XML allows the integration of SQL and XQuery, this chapter focuses on the integration of SQL and XPath, which is supported in both DB2 for z/OS and DB2 for Linux, UNIX, and Windows. The discussion of SQL/XML in this chapter assumes that you have a good understanding of XPath (see Chapter 6, Querying XML Data: Introduction and XPath). The examples in this chapter also use the same two sample documents that were used throughout Chapter 6. Please refer to Figure 6.7 in section 6.3, Sample Data for XPath, SQL/XML, and XQuery. All examples are based on the following customer table: CREATE TABLE customer(id INTEGER, info XML) We assume that this table contains two rows with values 1003 and 1004 in the id column, and the two documents from Figure 6.7 in the XML column info. The remainder of this chapter is structured as follows: • An overview of SQL/XML is given in section 7.1. • The core SQL/XML functionality for extracting selected information from XML documents and defining XML predicates is covered in sections 7.2, 7.3, and 7.4. 159
160 Chapter 7 Querying XML Data with SQL/XML • Common mistakes with SQL/XML predicates are highlighted in section 7.5. • Parameter markers, dynamically computed XPath, sorting of XML data, and handling of binary data are discussed in sections 7.6 through 7.9. 7.1 OVERVIEW OF SQL/XML The term SQL/XML refers to the XML-specific features and functions in the SQL:2003 and SQL:2006 standards. SQL/XML defines the following: • The XML data type, which is a regular SQL type just like INTEGER or CHAR for example. SQL/XML defines the semantics of this type, not its storage format. • Functions that convert XML type values to and from non-XML data types, such as CHAR, VARCHAR, CLOB, and others. These functions are XMLSERIALIZE, XMLPARSE, and XMLCAST. • The function XMLVALIDATE for XML Schema validation and the predicate IS VALIDATED, which checks the validation status of an XML document or fragment. • XML publishing functions, also sometimes called constructor functions, such as XMLELEMENT, XMLATTRIBUTES, and XMLAGG, which allow you to construct new XML documents or fragments. The input data for such XML construction can come from relational columns, from XML columns, or both. This topic is covered in Chapter 10, Producing XML from Relational Data. • Functions to embed XPath and XQuery in SQL statements. These functions are XMLQUERY, XMLTABLE, and the XMLEXISTS predicate. All of these SQL/XML functions are supported in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. In this chapter we focus on the following: • XMLQUERY—A scalar function that is typically used in the SELECT clause of an SQL query to extract XML fragments or values from an XML document. • XMLTABLE—A table function that is used in the FROM clause of an SQL statement. It reads one or multiple values from an XML document and returns them as a set of rows. • XMLEXISTS—A predicate that is commonly used in the WHERE clause of an SQL statement to express predicates over XML data. • XMLCAST—A function that converts individual XML values to SQL data types. Now, let’s turn to examples to see how these functions work.
7.2 Retrieving XML Documents or Document Fragments with XMLQUERY 161 7.2 RETRIEVING XML DOCUMENTS OR DOCUMENT FRAGMENTS WITH XMLQUERY The simplest way of retrieving XML data with SQL is to include an XML column name in the SELECT list of an SQL query. For example, the SQL statement in Figure 7.1 returns a single column of type XML (info) and two rows, one row for each of our two sample documents in the customer table. Below the SQL statement in Figure 7.1 you see a corresponding XQuery that returns the same result. --SQL: SELECT info FROM customer; --XQuery: xquery db2-fn:xmlcolumn('CUSTOMER.INFO'); Figure 7.1 Retrieve all documents from the table You can extend the SQL query in Figure 7.1 with other features of the SQL language, such as a WHERE clause to select only specific rows (documents) from the table. This is shown in Figure 7.2, together with an equivalent XQuery for comparison. --SQL: SELECT info FROM customer WHERE id = 1003; --XQuery: xquery db2-fn:sqlquery('SELECT info FROM customer WHERE id = 1003'); Figure 7.2 Retrieve selected documents from the table In many situations it is desirable not to retrieve full documents from the database, but just specific XML elements, attributes, or fragments that are of interest. For example, if you only need to retrieve the customer names, you can use the XMLQUERY function in the SELECT clause to extract just that element (see Figure 7.3). The argument of the XMLQUERY function can be any XQuery or XPath expression. This expression needs to know which column to operate on, because a table could have multiple XML columns. The solution is to prefix the XPath with $INFO, a reference to the XML column in our sample table. This reference has to be in uppercase and must start with the $ sign (see section 7.2.1 for details). The SQL/XML statement in Figure 7.3 uses SQL as the top-level language and has an embedded XPath expression. Below it you see a corresponding XQuery that executes the same XPath expression without the use of any SQL. The query result and performance is the same. In particular, note that the return type of the XMLQUERY function is always XML. We will later discuss cases where SQL/XML can have advantages over XQuery and vice versa.
162 Chapter 7 Querying XML Data with SQL/XML --SQL/XML: SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer --XQuery: xquery db2-fn:xmlcolumn('CUSTOMER.INFO')/customerinfo/name; --Output: <name>Robert Shoemaker</name> <name>Matt Foreman</name> 2 record(s) selected. Figure 7.3 Extracting one element from each document The XMLQUERY function in Figure 7.3 is a scalar function, which means that it takes one value as input and produces one value as output. The XMLQUERY function is applied to one row at a time and so its input value is always the XML document of the current row. The XMLQUERY function typically never processes XML documents from multiple rows at the same time. Its output value is the result of the XPath expression applied to the current document. This result is always a sequence of zero, one, or more items. Such a sequence represents a single value (instance) of the XQuery Data Model. 7.2.1 Referencing XML Columns in SQL/XML Functions Figure 7.3 shows only one of three ways in which the XML column can be referenced inside the XMLQUERY function. Here are all three ways in more detail: • Direct reference of the XML column name as $INFO. This $INFO is an XQuery variable that is implicitly bound to an XML column of the same name. This is only supported in DB2 for Linux, UNIX, and Windows version 9.5 and higher. It only works if the XML column name is unique across all tables that are referenced in the FROM clause. For brevity we will use this notation in most of the examples in this chapter. SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer; • Explicit assignment of the XML column name to an alias of your choice, which is then used as the context at the beginning of the XPath expression. This assignment is done in the passing clause of the XMLQUERY function. It also allows you to qualify the column name with its table name (passing customer.info AS "i") to avoid ambiguity. The variable name $i has to be unique within each SQL/XML function, not across all functions. You will later see that this passing clause also allows you to pass parameter markers or expressions into the embedded XQuery. This is supported since version 9 of DB2 for z/OS and DB2 for Linux, UNIX, and Windows.
7.2 Retrieving XML Documents or Document Fragments with XMLQUERY 163 SELECT XMLQUERY('$i/customerinfo/name' passing info as "i") FROM customer; -- query with two tables, both have an XML column "info": SELECT XMLQUERY('$i/customerinfo/name' passing c1.info as "i"), XMLQUERY('$i/customerinfo/name' passing c2.info as "i") FROM customer c1, customer2 c2; • No XQuery variable at the beginning of the XPath expression. Instead, the XML column name is identified in the passing clause without assignment to a variable. This is only supported in DB2 for z/OS. SELECT XMLQUERY('/customerinfo/name' passing info) FROM customer; 7.2.2 Retrieving Element Values Without XML Tags There are several ways in which you can return the customer names without the element tags <name></name> around them. One option is to use /text() in the XPath expression to only return the text node of the name element, as in Figure 7.4 (a). The column in the query result set is still of type XML. Alternatively, you can wrap the function XMLCAST() around the XMLQUERY function to convert the XML result to a non-XML type, as in Figure 7.4 (b). XMLCAST() automatically removes the tags from the returned elements. The output is the same as from Figure 7.4 (a), except that the return type is VARCHAR(25) instead of XML. --(a) SQL/XML: SELECT XMLQUERY('$INFO/customerinfo/name/text()') FROM customer; --(b) SQL/XML: SELECT XMLCAST( XMLQUERY('$INFO/customerinfo/name') AS VARCHAR(25)) FROM customer; --Output: Robert Shoemaker Matt Foreman 2 record(s) selected. Figure 7.4 Returning element values without tags A common requirement is to retrieve multiple values from a document, such as the customers’ street and city, and to return them in separate columns of the same result row. Separate columns can be produced by using multiple XMLQUERY functions in the SELECT clause (see Figure 7.5).
164 Chapter 7 Querying XML Data with SQL/XML The same can be achieved with the XMLTABLE function, which is discussed later. Figure 7.5 also shows that you can return a mix of relational columns and XML values. SELECT id, XMLQUERY('$INFO/customerinfo/addr/street/text()'), XMLQUERY('$INFO/customerinfo/addr/city/text()') FROM customer; 1003 1004 845 Kean Street 1596 Baseline Aurora Toronto 2 record(s) selected. Figure 7.5 7.2.3 Returning multiple element values in separate columns Retrieving Repeating Elements with XMLQUERY The SQL/XML query in Figure 7.6 uses the path expression /customerinfo/phone, which you know returns multiple elements from each of the two input documents. This SELECT statement produces one result row for each of the two input rows. Each result row contains the sequence of phone numbers from the corresponding input document. Each of these two sequences is returned as a string, which the consuming application then needs to break down. However, such a sequence of two or more phone elements is not a well-formed XML document, because a single common root element is missing. Hence, if your application uses an XML parser to process this non-well-formed query result, it will fail with an error. SELECT id, XMLQUERY('$INFO/customerinfo/phone') FROM customer; 1003 1004 <phone type="work">905-555-7258</phone><phone type= "home">416-555-2937</phone><phone type="cell">905555-8743</phone> <phone type="work">905-555-4789</phone><phone type= "home">416-555-3376</phone> 2 record(s) selected. Figure 7.6 Returning a sequence of elements from each document Figure 7.7 shows the same query with /text(), and you see that the result values in each sequence are simply concatenated.
7.3 Retrieving XML Values in Relational Format with XMLTABLE 165 SELECT id, XMLQUERY('$INFO/customerinfo/phone/text()') FROM customer; 1003 1004 905-555-7258416-555-2937905-555-8743 905-555-4789416-555-3376 2 record(s) selected. Figure 7.7 Returning a sequence of text nodes from each document The conclusion is that the XMLQUERY function is typically not very useful to return repeating elements. As a solution, use the XMLTABLE function, which is explained in the next section. 7.3 RETRIEVING XML VALUES IN RELATIONAL FORMAT WITH XMLTABLE The XMLTABLE function is very versatile and one of the most powerful SQL/XML functions. Let’s start with some simple examples of the XMLTABLE function and then get back to returning the repeating phone elements in a more suitable format. 7.3.1 Generating Rows and Columns from XML Data The query in Figure 7.8 uses the XMLTABLE function in the FROM clause. The XMLTABLE function references the info column and is therefore implicitly joined with the table customer. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custID INTEGER PATH custname VARCHAR(20) PATH street VARCHAR(20) PATH city VARCHAR(16) PATH CUSTID -----1003 1004 CUSTNAME -------------------Robert Shoemaker Matt Foreman '@Cid', 'name', 'addr/street', 'addr/city') AS T; STREET -------------------845 Kean Street 1596 Baseline CITY -----------Aurora Toronto 2 record(s) selected. Figure 7.8 Using XMLTABLE to return XML values in relational columns In DB2 for z/OS the XMLTABLE function must contain a PASSING clause to define the reference to the XML column, like this: XMLTABLE('$i/customerinfo' PASSING info AS "i"
166 Chapter 7 Querying XML Data with SQL/XML The XMLTABLE function contains one row-generating XQuery expression and, in the COLUMNS clause, multiple column-generating expressions. The row-generating expression is the XPath $INFO/customerinfo and is applied to each XML document in the XML column and produces one or multiple rows per document. The row-generating expression produces one customerinfo element (fragment) per document. The output of the XMLTABLE function contains one row for each of these customerinfo elements. The number of elements produced by the row-generating XQuery expression determines the number of rows produced by the XMLTABLE function. The COLUMNS clause transforms XML data into relational format. Each of the entries in this clause defines a column with a column name and an SQL data type. In Figure 7.8, the returned rows have four columns named custID, custname, street, and city. The values for each column are extracted from the customerinfo fragments that are produced by the row-generating expression, and then cast to the SQL data types. For example, the path addr/city is applied to each customerinfo element to obtain the value for the column city. The row-generating expression provides the context for the column-generating expressions. This means that the column-generating expressions are not absolute paths, but relative to the row-generating expression. You can typically append the column-generating expressions to the row-generating expression to get an intuitive idea of what a given XMLTABLE function returns in its columns. The result set of the XMLTABLE query can be treated like any SQL table. You can query and manipulate it much like you use regular row sets or views. The column definitions in the COLUMNS clause can use any SQL data type, such as INTEGER, DECIMAL, CHAR, DATE, and so on. If an extracted XML value cannot be cast to the assigned SQL type, the query fails with an error message. DB2 for Linux, UNIX, and Windows also allows you to use the db2-fn:xmlcolumn() or db2-fn:sqlquery() functions in the row-generating expression of the XMLTABLE function (see Figure 7.9). In this case you omit the table name customer from the FROM clause. The query result is the same as in Figure 7.8. (This is not available in DB2 for z/OS.) SELECT T.* FROM XMLTABLE('db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo' COLUMNS custID INTEGER PATH '@Cid', custname VARCHAR(20) PATH 'name', street VARCHAR(20) PATH 'addr/street', city VARCHAR(16) PATH 'addr/city') AS T; Figure 7.9 Alternative syntax in DB2 for Linux, UNIX, and Windows
7.3 Retrieving XML Values in Relational Format with XMLTABLE 7.3.2 167 Dealing with Missing Elements XML data can contain optional elements that are not present in all documents. For example, in our sample data you can see that Robert Shoemaker does not have an assistant element. What happens if the optional element assistant is referenced in the row-generating or a columngenerating expression, respectively? Let’s look at these two cases separately. In Figure 7.10 the optional assistant element is referenced in the row-generating expression of the XMLTABLE function. The query seeks to return the name and phone number of all assistants in our customer data. Since the XMLTABLE function returns exactly one row for each node that is produced by the row-generating expression, it does not return any rows for the documents that do not contain an assistant element. Therefore, the query in Figure 7.10 returns the name and phone number of Matt Foreman’s assistant, but no information from Robert Shoemaker’s XML document where no assistant element is present. We will revisit this situation at the end of section 7.3. in a more complex scenario. SELECT T.* FROM customer, XMLTABLE('$i/customerinfo/assistant' PASSING info AS "i" COLUMNS a_name VARCHAR(20) PATH 'name', a_phone VARCHAR(20) PATH 'phone') AS T; A_NAME A_PHONE -------------------- -------------------Gopher Runner 416-555-3426 1 record(s) selected. Figure 7.10 Optional element in the row-generating expression In Figure 7.11 the optional assistant element is referenced in a column-generating expression of the XMLTABLE function. This query intends to return the customer name and the assistant name from each document. For each document where the assistant element does not exist, the column expression assistant/name produces an empty sequence, which is automatically converted to a NULL value.
168 Chapter 7 Querying XML Data with SQL/XML SELECT T.* FROM customer, XMLTABLE('$i/customerinfo' PASSING info AS "i" COLUMNS c_name VARCHAR(20) PATH 'name', a_name VARCHAR(20) PATH 'assistant/name') AS T; C_NAME -------------------Robert Shoemaker Matt Foreman A_NAME -------------------NULL Gopher Runner 2 record(s) selected. Figure 7.11 Optional element in a column-generating expression If you prefer to generate a default value for missing elements instead of NULL values, use the default clause to define a default value other than NULL. This is done in Figure 7.12. SELECT T.* FROM customer, XMLTABLE('$i/customerinfo' PASSING info AS "i" COLUMNS c_name VARCHAR(20) PATH 'name', a_name VARCHAR(20) default 'none' PATH 'assistant/name') AS T; C_NAME -------------------Robert Shoemaker Matt Foreman A_NAME -------------------none Gopher Runner 2 record(s) selected. Figure 7.12 7.3.3 Defining a default value for missing elements Avoiding Type Errors Be aware that every expression in the COLUMNS clause must return a value that can be cast to the specified data type. Otherwise the XMLTABLE execution fails. Consider the following cases: • Incompatible data types. For example, the query in Figure 7.8 fails when it encounters an XML document where the Cid attribute has a non-numeric value, which cannot be cast to INTEGER. • String length. If the XMLTABLE function defines a column of type CHAR(n) or VARCHAR(n), and the column-generating expression produces a string value that’s longer than n, then either one of two things happen:
7.3 Retrieving XML Values in Relational Format with XMLTABLE 169 • The value is truncated to n bytes, without warning or error. This truncation is mandated by the latest SQL/XML standard and implemented in DB2 for z/OS. • The query fails with error SQL16061N. This behavior was allowed by a previous version of the SQL/XML standard and is still effective in DB2 for Linux, UNIX, and Windows. The following examples show how such cases can be handled. In Figure 7.13, the definition of the custID column uses the XQuery if-then-else and castable expressions to check whether the Cid attribute can indeed be cast to INTEGER, and returns -1 if not. The value for the column custname is produced by the substring function so that only the first 20 characters of the actual name are used. The column-generating expression for the city uses if-then-else and the string-length function to test the length of the city value and returns an error flag if it is too long. Such techniques can be useful if strict data types are not enforced with XML Schema validation. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custID INTEGER PATH '(if (@Cid castable as xs:integer) then @Cid else -1)', custname VARCHAR(20) PATH 'name/substring(.,1,20)', street VARCHAR(20) PATH 'addr/street', city VARCHAR(16) PATH 'addr/city/(if (string-length(.) <= 16) then . else "Error!")') AS T; Figure 7.13 7.3.4 Safeguarding against type errors in XMLTABLE Retrieving Repeating Elements with XMLTABLE Another error condition arises when a path expression in the COLUMNS clause returns a sequence of two or more items. In this situation the XMLTABLE execution fails, because it is not possible to convert a sequence of multiple XML values into a single atomic SQL value. As an example, consider the phone element, which occurs multiple times per document. The following example produces a list of customer names with their phone numbers. The first attempt is shown in Figure 7.14. This query fails with error message SQL16003N. This message means that the query is trying to cast an XML sequence of multiple items to a single VARCHAR value, which is not possible. The reason for this error is that phone returns multiple values per customerinfo element.
170 Chapter 7 Querying XML Data with SQL/XML SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custname VARCHAR(20) PATH 'name', phone VARCHAR(15) PATH 'phone') AS T; SQL16003N An expression of data type "( item(), item()+ )" cannot be used when the data type "VARCHAR_15" is expected in the context. Error QName=err:XPTY0004. SQLSTATE=10507 Figure 7.14 Cannot map a sequence of multiple items to an SQL data type! There are at least five ways to avoid this error: • Return only one of multiple phone numbers (see Figure 7.15 and Figure 7.16) • Return a list of multiple phone numbers in a single VARCHAR value (see Figure 7.17) • Return a list of multiple phone numbers as an XML type (see Figure 7.18) • Return multiple phone columns (see Figure 7.19) • Return one row per phone number (see Figure 7.20) To return only one of the phone numbers you can add a positional predicate [1] to the column generating path expression, so that only the first phone element is returned (see Figure 7.15). SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custname VARCHAR(20) PATH 'name', phone VARCHAR(12) PATH 'phone[1]') AS T; Robert Shoemaker Matt Foreman 905-555-7258 905-555-4789 2 record(s) selected. Figure 7.15 Return only the first of multiple phone numbers Alternatively, you could add a predicate on the type attribute of the phone element to only return phones of a certain kind. The query in Figure 7.16 produces cell phone numbers only. Since Matt Foreman doesn’t have a cell phone, a NULL value is returned instead. If there were a customer who has multiple cell phones (that is, multiple phone elements where the type attribute has the value “cell”) the query would still fail with error SQL16003N.
7.3 Retrieving XML Values in Relational Format with XMLTABLE 171 SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custname VARCHAR(20) PATH 'name', phone VARCHAR(12) PATH 'phone[@type="cell"]') AS T; Robert Shoemaker Matt Foreman 905-555-7258 NULL 2 record(s) selected. Figure 7.16 Return only one type of phone number If you need to return all phone numbers, you can list them in a single column value. However, VARCHAR(12) is too small for multiple phone numbers. Use VARCHAR(100) here, which can hold multiple phone numbers separated by a comma, as shown in Figure 7.17. The function string-join requires two parameters: a sequence of string values and a separator character. In this example, the first parameter is the sequence of the phone element text nodes, and the second parameter is the comma character “,”. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custname VARCHAR(20) PATH 'name', phone VARCHAR(100) PATH 'string-join(phone/text(),",")') AS T; Robert Shoemaker Matt Foreman 905-555-7258,416-555-2937,905-555-8743 905-555-4789,416-555-3376 2 record(s) selected. Figure 7.17 Return a list of multiple phone numbers in a single VARCHAR value Yet another option for dealing with multiple phone numbers is to return an XML sequence of phone elements. To achieve this, the generated phone column needs to be of type XML. This allows you to return any XML value as the result of the XPath expression. This value can be an atomic value or a sequence of zero or more items. The query in Figure 7.18 returns one row per customer with their phone elements as an XML sequence in the XML column phone. Such a sequence of multiple elements is not a well-formed document, because a single common root element is missing. If you need to produce well-formed XML documents, you can wrap the sequence of phone elements in a new root element. For example, you could change the path expression in the COLUMNS clause from 'phone' to '<phones>{phone}</phones>'. This notation is called direct element construction and explained in detail in section 8.4, Constructing XML Data.
172 Chapter 7 Querying XML Data with SQL/XML SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custname VARCHAR(20) PATH 'name', phone XML PATH 'phone') AS T; CUSTNAME PHONE ---------------------- --------------------------------------Robert Shoemaker <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> Matt Foreman <phone type="work">905-555-4789</phone> <phone type="home">416-555-3376</phone> 2 record(s) selected. Figure 7.18 Return a list of multiple phone numbers as an XML type The XMLTABLE function also allows you to return each phone number as a separate VARCHAR value, by producing a fixed number of phone columns. The query in Figure 7.19 generates the column custname for the customer name, plus three columns for phone numbers: phone1, phone2, and phone3. Positional predicates are used to map the first phone element in a document to the column phone1, the second phone element to the column phone2, and the third phone element to phone3. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custname VARCHAR(20) PATH phone1 VARCHAR(12) PATH phone2 VARCHAR(12) PATH phone3 VARCHAR(12) PATH CUSTNAME -------------------Robert Shoemaker Matt Foreman PHONE1 -----------905-555-7258 905-555-4789 'name', 'phone[1]', 'phone[2]', 'phone[3]') as T; PHONE2 -----------416-555-2937 416-555-3376 PHONE3 -----------905-555-8743 NULL 2 record(s) selected. Figure 7.19 Return multiple phone columns An obvious drawback to this approach is that a variable number of items is mapped to a fixed number of columns. This is a conceptual mismatch. A customer might have more phone numbers than anticipated. Others might have fewer, which results in NULL values. But, depending on the
7.3 Retrieving XML Values in Relational Format with XMLTABLE 173 requirements of your application, mapping different occurrences of an element to different columns in the result set can be a very useful query writing technique. The fifth option for dealing with multiple phone elements per customer is to return them in separate rows. In this case, you need to produce one row per phone number instead of one row per customer. For that purpose, the XMLTABLE function in Figure 7.20 uses a different rowgenerating XPath expression: /customerinfo/phone. The number of elements identified by this row-generating expression determines the number of rows produced by the XMLTABLE function. Since there is a one-to-many relationship between customers and phones, the customer names get repeated for each of their phones. Remember that the row-generating expression provides the context for the column-generating path expressions. As the context now consists of phone elements and not customerinfo elements, the XPath expressions in the COLUMNS clause have changed accordingly. The path for the custname column begins with a parent step (two dots) because name is a sibling of phone. The path for the phone column is simply a dot, which denotes the current context and that is always the current phone element. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo/phone' COLUMNS custname VARCHAR(20) PATH '../name', phone VARCHAR(15) PATH '.', type VARCHAR(10) PATH '@type') AS T; CUSTNAME -------------------Robert Shoemaker Robert Shoemaker Robert Shoemaker Matt Foreman Matt Foreman PHONE --------------905-555-7258 416-555-2937 905-555-8743 905-555-4789 416-555-3376 TYPE ---------work home cell work home 5 record(s) selected. Figure 7.20 7.3.5 Return one row per phone element Numbering XMLTABLE Rows Based on Repeating Elements The query in Figure 7.21 is the same as in Figure 7.20 except that the column seqno has been added to the XMLTABLE function. The definition of this column does not consist of a data type and a path, but just of the keywords FOR ORDINALITY. This produces a column of type BIGINT that contains consecutive numbers for the rows produced by the row-generating expression of the XMLTABLE function. This numbering automatically restarts at 1 for each input document,
174 Chapter 7 Querying XML Data with SQL/XML because the rows generated from each document are numbered separately. These ordinality numbers reflect the order in which the values appeared in the corresponding input document. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo/phone' COLUMNS seqno FOR ORDINALITY, custname VARCHAR(20) PATH '../name', phone VARCHAR(15) PATH '.', type VARCHAR(10) PATH '@type') AS T; SEQNO ----1 2 3 1 2 CUSTNAME -------------------Robert Shoemaker Robert Shoemaker Robert Shoemaker Matt Foreman Matt Foreman PHONE --------------905-555-7258 416-555-2937 905-555-8743 905-555-4789 416-555-3376 TYPE ---------work home cell work home 5 record(s) selected. Figure 7.21 7.3.6 Add a sequence number for each generated row Retrieving Multiple Repeating Elements at Different Levels To allow for another interesting example, let’s assume that the customer Matt Foreman has multiple assistants, and each assistant can have multiple phone numbers, as shown in Figure 7.22. <customerinfo Cid="1004"> <name>Matt Foreman</name> <addr country="Canada"> <street>1596 Baseline</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M3Z 5H9</pcode-zip> </addr> <assistant> <name>Gopher Runner</name> <phone type="home">416-555-3426</phone> <phone type="cell">416-911-1234</phone> </assistant> <assistant> <name>Peter Browse</name> <phone type="work">905-841-0701</phone> <phone type="home">416-696-2620</phone> </assistant> </customerinfo> Figure 7.22 Sample document
7.3 Retrieving XML Values in Relational Format with XMLTABLE 175 In the XML structure shown in Figure 7.22 there is a one-to-many relationship between customers and assistants, and between assistants and phones. How can you produce a list that includes customer names, assistant names, and assistant phone numbers? The trick is to use a row-generating expression that navigates to the deepest repeating element, which is /customerinfo/assistant/phone. This is shown in Figure 7.23. Based on the assistant phone element, two consecutive parent steps are required to reach the customer’s name element, and just one parent step to obtain the assistant name. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo/assistant/phone' COLUMNS custname VARCHAR(20) PATH '../../name', assistant VARCHAR(20) PATH '../name', a_phone VARCHAR(12) PATH '.') AS T; CUSTNAME -------------------Matt Foreman Matt Foreman Matt Foreman Matt Foreman ASSISTANT --------------Gopher Runner Gopher Runner Peter Browse Peter Browse A_PHONE -----------416-555-3426 416-911-1234 905-841-0701 416-696-2620 4 record(s) selected. Figure 7.23 Navigate to the deepest repeating element first Note that the result set in Figure 7.23 does not include any row for Robert Shoemaker. This is because the XML document for Robert Shoemaker does not contain an assistant element. Hence, the row-generating expression in this query never produces any rows for that input document. This is fine if the intention was to only list customers who have assistants. But, if you need to list all customers even if they don’t have assistants, then the query in Figure 7.23 produces an incomplete result. Figure 7.24 shows one possible way in which you can include Robert Shoemaker in the result set. The key idea is to extend the row-generating expression so that it produces a row for a customer even if the assistant element does not exist. The new row-generating expression is /customerinfo/(assistant/phone, .[not(assistant)]/name/text() ) It uses a sequence constructor, which was discussed in section 6.12, Union and Construction of Sequences. The first expression in the sequence constructor, assistant/phone, produces assistant phone elements if they exist. The second expression, .[not(assistant)]/name/ text(), produces the text node of the customer name if the assistant element does not exist.
176 Chapter 7 Querying XML Data with SQL/XML The two expressions are mutually exclusive in the sense that for any given document only one of them produces any items while the other produces an empty sequence. The existence of the assistant element determines whether the first or the second expression produces any nodes. The same behavior can be achieved with an if-then-else expression, which you may find more intuitive: /customerinfo/(if (assistant) then assistant/phone else name/text() ) Since the XPath expressions in the COLUMNS clause are applied to the nodes produced by the rowgenerating expression, they now need to work for both assistant phone elements as well as customer name text nodes. We purposefully navigate to the text nodes of the name elements, because they are on the same level of the document tree as the assistant phone elements. With this trick, the paths in the column definition work for both. For example, the path ../../name always produces the customer name element, no matter whether the context node is an assistant phone element or customer name text node. Note that the predicate [../phone] was added to the column expression for a_phone. This predicate ensures that the column a_phone is populated only when the current node produced by the row-generating expression is a phone element, and not a customer name text node. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo/(assistant/phone , .[not(assistant)]/name/text() )' COLUMNS custname VARCHAR(20) PATH '../../name', assistant VARCHAR(20) PATH '../name', a_phone VARCHAR(12) PATH '.[../phone]') AS T; CUSTNAME -------------------Robert Shoemaker Matt Foreman Matt Foreman Matt Foreman Matt Foreman ASSISTANT --------------NULL Gopher Runner Gopher Runner Peter Browse Peter Browse A_PHONE -----------NULL 416-555-3426 416-911-1234 905-841-0701 416-696-2620 5 record(s) selected. Figure 7.24 Producing rows for missing elements
7.4 7.4 Using XPath Predicates in SQL/XML with XMLEXISTS 177 USING XPATH PREDICATES IN SQL/XML WITH XMLEXISTS Most of the SQL/XML queries in the previous sections don’t have a WHERE clause. Therefore they produce results from all documents (rows) in the customer table. One exception was Figure 7.2 in section 7.2, which uses a predicate on the relational id column as a document filter (WHERE id = 1003). Your tables might have additional relational columns and you can use all traditional SQL capabilities to formulate WHERE clauses with relational predicates or subqueries to select rows (documents) from your table. However, you will also want to use XPath predicates to filter query results based on values in the XML data, as discussed in section 6.7, XPath Predicates. The most typical way of using an XPath predicate in an SQL/XML statement is to include it in the XMLEXISTS predicate. The XMLEXISTS predicate evaluates the embedded XPath or XQuery expression one document (row) at a time. If the XPath returns a non-empty result, that is, a sequence of one or more items, then XMLEXISTS returns TRUE and the corresponding row is included in the result set. If the XPath does not return any items, that is, it returns an empty sequence, then XMLEXISTS returns FALSE and the corresponding row is eliminated from the result set. Let’s look at Figure 7.25 to see how this works. Much like the XMLQUERY function, the XMLEXISTS predicate references the info column of the customer table to evaluate the XPath expression. For each row, the question is whether this XPath returns an empty or non-empty result. In general, the XPath /customerinfo[addr/city = "Aurora"] returns customerinfo elements if they have an addr element with a city that has the value Aurora. Otherwise it returns the empty sequence. The first row in the customer table contains a document that has such a customerinfo element where addr/city equals Aurora. This means that the result of the XPath is non-empty, so XMLEXISTS returns TRUE and the row is qualified for the result set. The XMLQUERY function in the SELECT clause then extracts the name element and produces a result row. The second row in our customer table contains a document where the city element does not have the value Aurora. So the document has no customerinfo element that fulfills the predicate in square brackets. The XPath inside the XMLEXISTS predicate therefore returns the empty sequence. This causes the XMLEXISTS predicate to return FALSE and the row is eliminated. Consequently, the XMLQUERY function is not applied to this second document and no further result rows are produced.
178 Chapter 7 Querying XML Data with SQL/XML SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE XMLEXISTS('$INFO/customerinfo[addr/city = "Aurora"]'); <name>Robert Shoemaker</name> 1 record(s) selected. Figure 7.25 SQL/XML query with XMLEXISTS predicate in the WHERE clause For comparison, the query in Figure 7.26 returns the same result as the one in Figure 7.25. The difference is that the query in Figure 7.26 is a single XPath expression with no SQL involved. It contains the filtering predicate in square brackets and defines the return value with the last step of the path: /name. In Figure 7.25, the same processing is split across two XPaths in the SQL/XML statement. The predicate is expressed with XMLEXISTS in the WHERE clause and its sole purpose is the elimination of non-matching rows. The extraction (projection) of the name element happens separately in the XMLQUERY function in the SELECT clause. This is a common and useful pattern of a SQL/XML query. The performance and execution plans of the queries in Figure 7.25 and Figure 7.26 are identical. xquery db2-fn:xmlcolumn('CUSTOMER.INFO')/customerinfo[ addr/city = "Aurora"]/name; Figure 7.26 XQuery that produces the same result as the query in Figure 7.25 If an XML index is defined on /customerinfo/addr/city, the queries in both Figure 7.25 and Figure 7.26 can use that index to speed up the evaluation of the predicate and to avoid a table scan. The query in Figure 7.27 also returns the same result but cannot use an XML index and is an example of how you should not write SQL/XML queries. It uses the XMLQUERY function in the WHERE clause to extract the city element and the XMLCAST function to cast that value to VARCHAR(20). Then an SQL equality predicate is applied to this value. This works but is not recommended because the usage of functions in the WHERE clause prohibits index usage and leads to a table scan. To express SQL/XML predicates, use XMLEXISTS instead. SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE XMLCAST( XMLQUERY('$INFO/customerinfo/addr/city') AS VARCHAR(20)) = 'Aurora'; Figure 7.27 XMLQUERY in the WHERE clause is typically not recommended! Any of the XPath predicates discussed in Chapter 6 can be used inside XMLEXISTS to filter XML documents. This includes value predicates as well as structural predicates that check for the existence or non-existence of an element (see section 6.7) For example, the query in Figure 7.28
7.4 Using XPath Predicates in SQL/XML with XMLEXISTS 179 selects the id and name of all customers who have an assistant. The query in Figure 7.29 selects the id and name of the customers who do not. SELECT id, XMLQUERY('$INFO/customerinfo/name/text()') FROM customer WHERE XMLEXISTS('$INFO/customerinfo[assistant]'); 1004 Matt Foreman 1 record(s) selected. Figure 7.28 A structural predicate checks for the existence of an element SELECT id, XMLQUERY('$INFO/customerinfo/name/text()') FROM customer WHERE XMLEXISTS('$INFO/customerinfo[not(assistant)]'); 1003 Robert Shoemaker 1 record(s) selected. Figure 7.29 A structural predicate with negation You can also use XMLEXISTS in conjunction with the XMLTABLE function, as shown in Figure 7.30. In this example, the XMLEXISTS predicate selects one row from the customer table, and only that one document is processed by the XMLTABLE function. Predicates in the columngenerating expressions of the XMLTABLE function, such as [@type="home"] in Figure 7.30, do not affect the number of rows returned. This predicate only selects one out of multiple phone elements from each qualifying document. Hence, it is an intra-document predicate whose filtering effect is restricted to items within a single document. In contrast, the XMLEXISTS predicate filters rows from the entire table. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custname VARCHAR(20) PATH 'name', phone VARCHAR(12) PATH 'phone[@type="home"]') AS T WHERE XMLEXISTS('$INFO/customerinfo[@Cid = 1003]'); Robert Shoemaker 416-555-2937 1 record(s) selected. Figure 7.30 Predicates in an XMLTABLE column expression do not filter rows
180 Chapter 7 Querying XML Data with SQL/XML The query in Figure 7.31 produces the same result as the query in Figure 7.30. It does not use XMLEXISTS but an XPath predicate in the row-generating expression of the XMLTABLE function. Remember that XMLTABLE is a table function that produces one row for each item returned by the row-generating expression. Hence, a predicate in the row-generating expression eliminates rows just like a predicate in XMLEXISTS does. For consistency across all your queries, you might prefer to always use XMLEXISTS and not put predicates in the row-generating expression of an XMLTABLE function. There is no significant performance difference between the queries in Figure 7.30 and Figure 7.31. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo[@Cid = 1003]' COLUMNS custname VARCHAR(20) PATH 'name', phone VARCHAR(12) PATH 'phone[@type="home"]') AS T; Figure 7.31 Predicates in a row-generating expression filter rows You can also use regular SQL predicates on the relational columns produced by the XMLTABLE function. In Figure 7.32, the XMLTABLE function returns the values of the /customerinfo/ name elements as a VARCHAR(20) column called custname. This column is then used in the WHERE clause to restrict the result set to Robert Shoemaker. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custname VARCHAR(20) PATH 'name', phone VARCHAR(12) PATH 'phone[@type="home"]') AS T WHERE custname = 'Robert Shoemaker'; Robert Shoemaker 416-555-2937 1 record(s) selected. Figure 7.32 Using a relational predicate on a column generated by XMLTABLE The query in Figure 7.32 is interesting because it applies a relational predicate to values that are extracted from the /customerinfo/name elements in the XML column. DB2 9 for z/OS and DB2 9.7 for Linux, UNIX, and Windows can exploit an XML index on /customerinfo/name to evaluate this relational predicate. This capability is not available in DB2 9.1 and DB2 9.5 for Linux, UNIX, and Windows.
7.5 Common Mistakes with SQL/XML Predicates 181 If you use multiple search conditions, it is generally better to combine them into a single XMLEXISTS instead of using multiple XMLEXISTS. Figure 7.33 shows both options, which return the same result from our sample data. SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE XMLEXISTS('$INFO/customerinfo[addr/city = "Aurora"]') AND XMLEXISTS('$INFO/customerinfo[addr/@country = "Canada"]'); SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE XMLEXISTS('$INFO/customerinfo[addr/city = "Aurora" and addr/@country = "Canada"]'); <name>Robert Shoemaker</name> 1 record(s) selected. Figure 7.33 7.5 Using a single XMLEXISTS predicate is preferable COMMON MISTAKES WITH SQL/XML PREDICATES An easy mistake is to include the row-filtering predicate in the XMLQUERY function in the SELECT clause and not in XMLEXISTS in the WHERE clause (see Figure 7.34). This query always returns as many rows as there are rows in the customer table, which could be very many! This SQL statement does not include a WHERE clause and therefore never eliminates any rows. The XMLQUERY function therefore produces a result for every row in the table. For customers living in Aurora it returns their name; for all other customers it returns an empty sequence. If the customer table contained one customer in Aurora and 100,000 customers who do not live in Aurora, this query would return one name element plus 100,000 empty rows. This is not desirable. SELECT XMLQUERY('$INFO/customerinfo[addr/city = "Aurora"]/name') FROM customer; <name>Robert Shoemaker</name> [first result row] [second result row (empty)] 2 record(s) selected. Figure 7.34 Predicates in the SELECT list do not filter rows! An XPath predicate expressed in XMLEXISTS in the WHERE clause can filter rows. A predicate in the XMLQUERY function in the SELECT clause cannot filter rows. It can only restrict the output that is produced from each XML document. This is further illustrated in Figure 7.35. The XMLEXISTS
182 Chapter 7 Querying XML Data with SQL/XML predicate in the WHERE clause selects one row from the sample table; that is, the row with the document where the Cid attribute is 1003. The predicate [@type = "home"] in the XMLQUERY function does not affect the number of rows returned. It only ensures that the result row contains only the home phone number and not a list of all phone numbers of the selected customer. SELECT XMLQUERY('$INFO/customerinfo/phone[@type = "home"]') FROM customer WHERE XMLEXISTS('$INFO/customerinfo[@Cid = 1003]'); <phone type="home">416-555-2937</phone> 1 record(s) selected. Figure 7.35 Only the XMLEXISTS predicate filters rows Another easy mistake is to forget the square brackets in the XPath expression in the XMLEXISTS predicate (see Figure 7.36). This comes back to the same issue as was discussed for the query in Figure 6.35 in section 6.7. Without square brackets the XPath expression: /customerinfo/addr/city = "Aurora" is a Boolean predicate of the form A = B and always evaluates to either true or false. It never evaluates to the empty sequence. The result is always a sequence that contains one item, and that item is either the value true or the value false. Remember that XMLEXISTS eliminates a row only if the XPath expression evaluates to the empty sequence. It truly performs an existence check. If the XPath expression evaluates to a non-empty sequence, such as the sequence with the single value true or false, XMLEXISTS does not eliminate the current row. Since the XPath expression in the XMLEXISTS predicate in Figure 7.36 never produces an empty sequence, no rows are ever eliminated and the result set contains as many rows as the customer table. SELECT XMLQUERY('$INFO/customerinfo/addr/city) FROM customer WHERE XMLEXISTS('$INFO/customerinfo/addr/city = "Aurora"'); <city>Aurora</city> <city>Toronto</city> 2 record(s) selected. Figure 7.36 An XMLEXISTS predicate without square brackets is not useful!
7.6 7.6 Using Parameter Markers or Host Variables 183 USING PARAMETER MARKERS OR HOST VARIABLES All the queries with predicates that we have discussed so far use literal values to select specific documents. But often it is preferable to use parameter markers or host variables instead. This allows you to prepare (compile) a query only once and pass a different literal value for each execution of the query. This avoids recompiling the query for each execution. Very short database queries often execute so fast that the time to compile and optimize them is a substantial portion of their total response time. This is where parameter markers or host variables provide a significant performance benefit. Although you cannot use SQL-style parameter markers in XQuery, the SQL/XML functions XMLQUERY, XMLTABLE, and XMLEXISTS allow you to pass SQL parameter markers as variables into the embedded XQuery expression. This is recommended for applications with short and repetitive queries. Figure 7.37 shows two SQL/XML queries that select rows from the customer table where the city element has a specific value. The queries use a parameter marker (?) and host variable, respectively, instead of a literal string. The passing clause assigns the parameter or host variable to the XPath variable c. In the XPath expression itself, this variable is used as $c. The dollar sign is used to reference the variable, similar to how $INFO references the XML column. To ensure proper typing it is recommended to cast the parameter marker or host variable to an appropriate data type. Parameters that carry string values should always be cast to VARCHAR instead of CHAR; otherwise they are padded with blanks, which are included in the string comparison and lead to unexpected results. In XPath, trailing blanks are significant. SELECT info FROM customer WHERE XMLEXISTS('$INFO/customerinfo[addr/city = $c]') passing cast(? as VARCHAR(25)) AS "c"); SELECT info FROM customer WHERE XMLEXISTS('$INFO/customerinfo[addr/city = $c]') passing cast(:hvar as VARCHAR(25)) AS "c"); Figure 7.37 XML predicates with parameter markers and host variables The query in Figure 7.38 uses three parameter markers. One parameter appears in the XMLQUERY function to select which type of phone number to extract. The other two parameters are in the XMLEXISTS predicate to provide values that select customers in a specific city and country. Note that the passing clause can contain a comma-separated list of multiple input parameters. The same works with host variables.
184 Chapter 7 Querying XML Data with SQL/XML SELECT XMLQUERY('$INFO/customerinfo/phone[@type = $p1]' passing cast(? AS VARCHAR(10)) AS "p1") FROM customer WHERE XMLEXISTS('$INFO/customerinfo[addr/city = $p2 and addr/@country = $p3]') passing cast(? AS VARCHAR(25)) AS "p2", cast(? AS VARCHAR(30)) AS "p3"); Figure 7.38 XML predicates with multiple parameter markers The SELECT statement in Figure 7.39 shows that you can also pass parameter markers or host variables into the row-generating expression of the XMLTABLE function. Note that the SQL Standard does not allow a passing clause for the column-generating path expression in the XMLTABLE function. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo[@Cid = $c]' passing cast(:custid AS INTEGER) AS "c" COLUMNS custname VARCHAR(20) PATH 'name', phone VARCHAR(12) PATH 'phone[@type = "home"]') AS T; Figure 7.39 Using a host variable in the XMLTABLE function If you also want to use a parameter to select the phone type, which is an intra-document predicate, you need to use an XMLQUERY function in the SELECT list (see Figure 7.40). The XMLCAST function casts the phone number to the same SQL type as the XMLTABLE function did in Figure 7.39. SELECT T.*, XMLCAST( XMLQUERY('$INFO/customerinfo/phone[@type = $t]' passing cast(:type AS VARCHAR(10)) AS "t") AS VARCHAR(20)) AS phone FROM customer, XMLTABLE('$INFO/customerinfo[@Cid = $c]' passing cast(:custid AS INTEGER) AS "c" COLUMNS custname VARCHAR(20) PATH 'name') AS T; Figure 7.40 A row-filtering and an intra-document predicate with host variables
7.7 7.7 XML Queries with Dynamically Computed XPath Expressions 185 XML QUERIES WITH DYNAMICALLY COMPUTED XPATH EXPRESSIONS As you develop more and more sophisticated XML applications, you might encounter situations where it is useful to not have a fixed hard-coded XPath in your query, but to dynamically compute the XPath navigation steps at runtime. However, this is not immediately possible in SQL/XML. Also, you cannot provide a parameter marker or variable with a value such as “/customerinfo/name” to an SQL/XML query and use this variable value as a path for XML navigation. Some degree of flexibility can be achieved with dynamically prepared SQL/XML statements in a stored procedure such as the one in Figure 7.41. The procedure takes two XPath expressions as input, one to extract information from XML documents, the other to filter the XML documents with a predicate. The stored procedure plugs these XPath expressions into an SQL/XML statement, which is then prepared and executed with a cursor. The procedure leaves the cursor open, which allows the caller, such as your Java application or another stored procedure, to iterate over the result set of the query. CREATE PROCEDURE dynXMLquery (IN IN LANGUAGE SQL BEGIN ATOMIC XPathExtract VARCHAR(1024), XPathFilter VARCHAR(1024) ) DECLARE sql VARCHAR(2048); DECLARE c1 CURSOR with return to caller FOR stmt; SET sql= 'SELECT XMLQUERY('' $INFO'|| XPathExtract || ' '' ) FROM customer WHERE XMLEXISTS('' $INFO'|| XPathFilter || ' '' )'; PREPARE stmt FROM sql; OPEN c1 ; END # Figure 7.41 Stored procedure to execute XPath dynamically Since the body of a stored procedure can contain multiple statements, these statements have to be separated by the semicolon character. Therefore you cannot use the semicolon as the terminating character for the CREATE PROCEDURE statement. In this example we have chosen the # as the terminating character. If the procedure definition shown in Figure 7.41 is in a file create_proc. sql then the following command issued at the OS prompt creates the procedure: db2 -td# -f create_proc.sql
186 Chapter 7 Querying XML Data with SQL/XML The following call statement invokes the stored procedure and passes two XPath expressions. The XPath expressions are literal strings in this example, but they could also be passed as string variables computed by an application or another stored procedure. call dynXMLquery('/customerinfo/name', '/customerinfo/addr[city="Aurora"]' ) 7.8 ORDERING A QUERY RESULT SET BASED ON XML VALUES One thing that distinguishes the XML data type from other SQL data types such as INTEGER, VARCHAR, or DATE is that you cannot perform any SQL comparisons on values of type XML. Sorting of XML type values in an SQL ORDER BY clause is also not possible because sorting involves comparisons. A value of type XML can be a complex nested XML document and there is no welldefined notion of equality, sort order, or collation among two or more XML documents. Therefore, the statements shown in Figure 7.42 are not meaningful and fail with errors. The error messages indicate that the XML data type cannot be used in SQL comparison or sort operations. SELECT id, info FROM customer ORDER BY info; SQL20353N An operation involving comparison cannot use operand "INFO" defined as data type "XML". SQLSTATE=42818 SELECT id, info FROM customer WHERE XMLQUERY('$INFO/customerinfo/addr/city') = 'Aurora'; SQL0401N The data types of the operands for the operation "=" are not compatible or comparable. SQLSTATE=42818 SELECT id, info FROM customer ORDER BY XMLQUERY('$i/customerinfo/name' passing info as "i"); SQL20353N An operation involving comparison cannot use operand "Ordering column 1" defined as data type "XML". SQLSTATE=42818 Figure 7.42 Values of type XML cannot be compared or ordered If you want to order a query result set by atomic values that are inside the XML documents in the XML column, you need to cast them to an SQL type first. The functions XMLCAST and XMLTABLE both perform conversion of XML values to SQL data types. The SELECT statements in Figure 7.43 successfully order the result set by the value of the customer name element.
7.9 Converting XML Values to Binary SQL Types 187 SELECT id, info FROM customer ORDER BY XMLCAST( XMLQUERY('$i/customerinfo/name' passing info as "i") AS VARCHAR(25)) ; SELECT id, info FROM customer, XMLTABLE('$INFO/customerinfo/name' COLUMNS custname VARCHAR(25) PATH '.') ORDER BY custname ; Figure 7.43 Ordering a query result set based on converted XML values Note that the XMLCAST function can only cast a single value at a time. If the first query in Figure 7.43 tried to order by /customerinfo/phone instead of /customerinfo/name, then the XMLQUERY function would return a sequence of multiple phone elements and hence XMLCAST fails. This is avoided by the use of XMLTABLE, which can iterate over repeating elements and cast them one at a time. 7.9 CONVERTING XML VALUES TO BINARY SQL TYPES Casting a value to a specific data type is possible only if the value is in the value space that the data type represents. For example, casting the string “123” to data type INTEGER works because 123 is a valid integer number. However, the string “abc” cannot be cast to INTEGER because alphanumeric character strings are not in the value space of the type INTEGER. The same concept applies when you use the XMLTABLE or XMLCAST functions to cast values to BLOB or VARCHAR FOR BIT DATA. Some values can be cast to a binary type, others cannot. The SQL/XML standard defines how values from XML documents are cast to SQL data types. It defines that a textual value in an XML document is first cast to an intermediate XQuery data type, and then to the target SQL data type. To discuss what that means, let’s assume that a document contains an element CCnumber, which contains binary data such as an encrypted credit card number. Now consider the following XMLTABLE function: XMLTABLE('$INFO/customerinfo' COLUMNS Custid INTEGER PATH '@Cid' CCNumber BLOB(2048) PATH 'CCnumber' ) As per the SQL/XML standard, DB2 reads the textual value of the Cid attribute, casts this value to the appropriate intermediate XQuery type, xs:integer, and then from xs:integer to the SQL type INTEGER. The value space of the XQuery type xs:integer is larger than that of the SQL type INTEGER and includes integers of 18 digits in length. Thus, if a value can be cast to xs:integer, it does not automatically imply that the value can also be cast to SQL INTEGER.
188 Chapter 7 Querying XML Data with SQL/XML For the second column, DB2 reads the textual value of the element CCNumber, casts this value to the appropriate intermediate XQuery type (xs:base64Binary), and then from xs:base64Binary to the SQL type BLOB(2048). The value space of xs:base64Binary is the set of finite-length chains of binary octets. This means that a string can be cast to xs:base64Binary only if it consists of a multiple of 8 bytes and does not contain any characters other than a-z, A-Z, 0-9, +, /, and =. Thus, if a cast to a binary SQL data type fails it’s likely because the original value is not a valid binary XML value. For example, the ASCII string “ABCD1234” can be cast to binary, but “ABCD12345” cannot because it has nine characters. 7.10 SUMMARY SQL/XML enables you to embed XPath and XQuery expressions in SQL statements to query XML and relational data in an integrated manner. Let’s quickly recapitulate the three most important functions for writing SQL/XML queries. XMLQUERY is an SQL scalar function that is typically used in the SELECT clause of an SQL query. It takes an XPath or XQuery expression as well as a reference to an XML column as input. For each row, the expression in the XMLQUERY function is applied to the XML document in that row and a single value of type XML is returned. This value is a sequence of zero, one, or multiple items. These items can be, for example, XML elements or atomic values such as numbers or strings. An XPath expression in an XMLQUERY function can contain predicates, but these predicates are only applied within any given document. The XMLQUERY function processes one document at a time and does not perform operations across multiple documents in multiple rows of the table. XMLEXISTS is a predicate that is commonly used in the WHERE clause of an SQL statement to express filtering conditions on an XML column. Like the XMLQUERY function, it takes an XPath expression as well as a reference to an XML column as input, and is applied to one XML document at a time. The XMLEXISTS predicate returns false and removes the current row from the result set, if the embedded XPath expression returns the empty sequence. The embedded XPath expression should always include an XPath predicate that must be enclosed in square brackets. XMLTABLE is a table function that is used in the FROM clause of an SQL statement. It reads one or multiple values from an XML document and returns them as a set of relational rows. The XMLTABLE function contains multiple XPath expressions; that is, one row-generating XPath expression and one or multiple column-generating XPath expressions. The XMLTABLE function generates one relational row for each XML element or attribute produced by the row-generating expression. Since XML documents can contain optional as well as repeating elements, the XMLTABLE function may produce zero, one, or multiple relational rows for each input document. The column-generating expressions compute the values that are returned in each row. These expressions are relative path expressions, based on the nodes identified by the row-generating expression. You will see further examples of these functions in the following chapters.
C H A P T E R 8 Querying XML Data with XQuery his chapter takes the discussion of XML queries to the next level and builds upon the previous two chapters. Chapter 6, Querying XML Data: Introduction and XPath, introduced the XPath and XQuery data model and described the XPath language. Chapter 7, Querying XML Data with SQL/XML, demonstrated how SQL/XML allows you to embed XPath expressions in SQL queries. Now we turn to XQuery, which is a query language for XML data and a superset of XPath. Everything you have learned about XPath already counts towards your understanding of XQuery. XQuery and XPath use a common data model, which was introduced in section 6.2, Understanding the XQuery and XPath Data Model. Understanding this data model is very helpful for understanding XQuery. In this chapter we introduce XQuery through a series of examples with focus on gaining a quick understanding and practical usage. This chapter is not meant to be a complete and formal XQuery language reference. Appendix C, Further Reading, contains suggestions for further reading about XQuery. XQuery is supported in DB2 for Linux, UNIX, and Windows, but not in DB2 9 for z/OS. T Many examples in this chapter use the same customerinfo sample documents that we used throughout Chapters 6 and 7 (see Figure 6.7 in section 6.3, Sample Data for XPath, SQL/XML, and XQuery). Section 8.5 then also introduces purchaseorder sample data, which is used to illustrate the features presented in the second half of this chapter. This chapter discusses the following topics: • Overview of XQuery expressions (section 8.1) • The XQuery FLWOR expression (section 8.2) • Differences and similarities between XPath and FLWOR expression and SQL/XML statements (section 8.3) 189
190 Chapter 8 Querying XML Data with XQuery • Constructing new XML documents with XQuery constructor expressions (section 8.4) • XQuery data types, arithmetic expressions, and functions (sections 8.5 through 8.7) • Using SQL queries and SQL functions within XQuery (sections 8.8 through 8.9) 8.1 XQUERY OVERVIEW XQuery is a functional language. It consists of several different kinds of expressions, which can be combined to compose more sophisticated expressions. Since XPath is a subset of XQuery, you have already seen some of those expressions in the previous chapters. Some of the most important XQuery expressions include: • Path expressions—Path expressions are used to locate nodes, such as XML elements and attributes, in the tree structure of an XML document. XPath expressions were introduced in Chapter 6 and they continue to play an important role in XQuery. • FLWOR expressions—FLWOR expressions allow you to iterate over the items in a sequence to bind variables to intermediate query results. Such expressions are useful for combining data from multiple XML documents or different parts of a single document. The name FLWOR, pronounced “flower,” is based on the keywords for, let, where, order by, and return. XQuery is a case-sensitive language and all keywords must be written in lowercase. FLWOR expressions are discussed in sections 8.2 and 8.3. • Constructor expressions—XQuery constructors can be used to create XML nodes, such as elements and attributes, so that you can build new XML documents within a query. This is explained in section 8.4. • Cast expressions—A cast expression converts a value to a different data type. Section 8.5 provides details on XQuery data types, cast expressions, and potential type errors. • Arithmetic expressions—XQuery has arithmetic operators for addition (+), subtraction (–), multiplication (*), division (div), integer division (idiv), and modulus (mod). See section 8.6 for details. • Comparison, logical, and conditional expressions—These expressions allow you to formulate predicates to search for specific information. You have seen many of these expressions in Chapter 6, especially in sections 6.7, 6.8, and 6.14. Conditional expressions are if-then-else expressions and already occurred in Chapter 7 in sections 7.3.3 and 7.3.6. • Sequence expressions—With sequence expressions you can construct or combine sequences. The construction and union of sequences was discussed in Chapter 6.12. • Transform expressions—Transform expressions allow you to update or transform existing XML documents. This is covered in Chapter 12, Updating and Transforming XML Documents.
8.2 Processing XML Data with FLWOR Expressions 8.2 191 PROCESSING XML DATA WITH FLWOR EXPRESSIONS The FLWOR expression is one of the most powerful and commonly used expressions in the XQuery language. It is comparable to the SELECT-FROM-WHERE statement in the SQL language. A significant difference is that an SQL SELECT statement operates on relational data (sets of tuples) whereas the XQuery FLWOR expression operates on XML data. XML data is more formally described by the XQuery Data Model as sequences of atomic values and nodes, such as XML element and attribute nodes (see section 6.2, Understanding the XQuery and XPath Data Model). 8.2.1 Anatomy of a FLWOR Expression Let’s first look at the generic syntax of the XQuery FLWOR expression (see Figure 8.1) and then walk through concrete examples. In Figure 8.1, the DB2 keyword xquery indicates that this is stand-alone XQuery and not an SQL statement. The body of the query contains the XQuery keywords for, let, where, order by, and return, which give the FLWOR expression its name. The second line of the query is the for clause. It consists of the keyword for, a variable, the keyword in, and an expression such as a path expression. The for clause iterates over the sequence of items that is produced by expression1. The let clause does not iterate but assigns the entire sequence produced by expression2 to the variable $variable2. Similar to a SELECT statement in SQL, the where and order by clauses filter and sort the result set, which is then returned (projected) in the return clause. A FLWOR expression must contain at least one for clause or at least one let clause, and must contain a return clause. The where and order by clauses are optional. xquery for $variable1 in expression1 let $variable2 := expression2 where expression3 order by expression4 [ascending|descending] return expression3 Figure 8.1 The syntax of the XQuery FLWOR expression Figure 8.2 shows a more concrete example of a FLWOR expression. It returns the phone elements of all customers who live in Canada. The for clause contains a path expression that you are already familiar with from Chapter 6. This path expression produces a sequence of customerinfo elements from the documents in the INFO column of the CUSTOMER table. The for clause iterates over this sequence. In each iteration the variable $c is assigned the next item (customerinfo element) in the sequence. The third line of the query contains the let clause. It assigns the result of the path expression $c/phone to the variable $p. Since $c holds the customerinfo element of the current iteration, $c/phone produces the sequence of phone elements for that customer. That entire sequence, which can contain multiple phone elements, is assigned to $p.
192 Chapter 8 Querying XML Data with XQuery xquery for $c in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo let $p := $c/phone where $c/addr/@country = "Canada" order by $c/@Cid descending return $p; Figure 8.2 The XQuery FLWOR expression Next, the where clause evaluates the predicate $c/addr/@country = "Canada". This predicate evaluates to true if a country attribute exists that has the value Canada. If the predicate evaluates to false, then the item of the current iteration is no longer considered in this query. Unlike XMLEXISTS, the predicate in a where clause of a FLWOR expression does not require square brackets. This is because the where clause checks whether the predicate evaluates to true or false, while XMLEXISTS checks for the existence of any value or node. You can add square brackets to the predicate in the where clause without changing the result of the query, for example where $c/addr[@country = "Canada"]. The expression $c/addr[@country = "Canada"] either evaluates to the empty sequence, in which case the predicate is false, or to a sequence of one or multiple addr elements, in which case the predicate is true. Regardless of the use of square brackets, the performance of this query can benefit from an XML index on /customerinfo/addr/@country. Also, note that in general the where clause can contain multiple predicates combined with and and or. For each item that meets the condition in the where clause, the return clause is evaluated to produce output (results). In Figure 8.2, the expression in the return clause is simply $p, which holds the sequence of a customer’s phone elements. The order by clause reorders the items in the iteration and causes the phone numbers of the customer with the largest value in the Cid attribute to be returned first. The phone elements of any individual customer are returned in the order in which they appear in the document. The result is shown in Figure 8.3. <phone <phone <phone <phone <phone type="work">905-555-4789</phone> type="home">416-555-3376</phone> type="work">905-555-7258</phone> type="home">416-555-2937</phone> type="cell">905-555-8743</phone> Figure 8.3 Result of the queries in Figure 8.2 and Figure 8.4 A DB2 client, such as the DB2 Command Line Processor or a JDBC application, receives this sequence of phone elements as a result set consisting of five rows in a single column of type XML. An application can iterate over these rows just like it normally does for relational result sets. The same query result can be produced by an SQL/XML statement, such as the one in Figure 8.4. It, too, produces a single column of type XML.
8.2 Processing XML Data with FLWOR Expressions 193 SELECT T.phone FROM customer, XMLTABLE('$INFO/customerinfo/phone' COLUMNS custID INTEGER PATH '../@Cid', phone XML PATH '.') AS T WHERE XMLEXISTS('$INFO/customerinfo[addr/@country = "Canada"]') ORDER BY T.custID DESC; Figure 8.4 SQL/XML query that also produces the result in Figure 8.3 There are some analogies between FLWOR expressions and SQL SELECT statements. Both can have an optional WHERE clause for filtering and an optional ORDER BY clause for sorting the result. The projection and the format of the result are defined by the SELECT clause in SQL and by the return clause in XQuery. The let and for clauses in XQuery roughly correspond to the FROM clause in SQL, defining the source of the data. 8.2.2 Understanding the for and let Clauses Every FLWOR expression has to have at least one for or let clause and must have a return clause. The for and let clauses introduce new variables, which can subsequently be referred to in other clauses of the FLWOR expression. However, for and let assign values to variables in different ways. Figure 8.5 highlights the difference between the for and the let clause. This example is purposefully simplified to clearly reveal important concepts. It does not use a sequence of XML values from an XML column, but it constructs a sequence containing the atomic values 1, 5, and 3. The first of the two FLWOR expressions in Figure 8.5 uses a for clause to iterate over these three items. In the first iteration, the variable $i assumes the value 1, in the second iteration the value 5, and in the third iteration the value 3. In each iteration, the return clause constructs a new element called result whose value is the value of the variable $i. The query returns three results rows, one for each item in the input sequence. The second query in Figure 8.5 uses a let clause. Contrary to the for clause, it does not iterate over the items in the sequence. Instead, it assigns the entire sequence to the variable $j. The return clause then returns the entire sequence enclosed in the newly constructed result element. You can certainly choose a different name for that element, if you like.
194 Chapter 8 Querying XML Data with XQuery db2 => xquery for $i in (1,5,3) return <result>{$i}</result>; <result>1</result> <result>5</result> <result>3</result> 3 record(s) selected. db2 => xquery let $j := (1,5,3) return <result>{$j}</result>; <result>1 5 3</result> 1 record(s) selected. Figure 8.5 8.2.3 The difference between for and let Understanding the where and order by Clauses Figure 8.6 shows two more versions of the previous query with the for clause. The first version has an additional where clause to restrict the result set to values greater than 2. The second query in Figure 8.6 adds an order by clause to return the result items in ascending order. Both the where and the order by clause use the variable $i that is introduced in the for clause. db2 => xquery for $i in (1,5,3) where $i > 2 return <result>{$i}</result>; <result>5</result> <result>3</result> 2 record(s) selected. db2 => xquery for $i in (1,5,3) where $i > 2 order by $i return <result>{$i}</result>; <result>3</result> <result>5</result> 2 record(s) selected. Figure 8.6 The effect of the where and order by clauses
8.2 Processing XML Data with FLWOR Expressions 8.2.4 195 FLWOR Expressions with Multiple for and let Clauses An XQuery FLWOR expression can contain multiple for or let clauses. Figure 8.7 shows two nested for clauses that act similarly to nested loops in a programming language. The outer for clause iterates over the sequence (1,5,3) and the inner for iterates over the sequence ("a","b"). For each iteration of the outer for clause, the inner for clause iterates over all the items in its sequence. This generates the full Cartesian product between the input sequences. An analogy in the SQL world is a SELECT statement with two tables in the FROM clause and no join predicate. db2 => xquery for $i in (1,5,3) for $j in ("a","b") return <result>{$i,$j}</result>; <result>1 <result>1 <result>5 <result>5 <result>3 <result>3 a</result> b</result> a</result> b</result> a</result> b</result> 6 record(s) selected. Figure 8.7 Two nested for clauses produce a Cartesian product The XQuery in Figure 8.8 also contains two nested for clauses. Their input sequences contain a common item, the atomic value 5, which is identified by a join predicate in the where clause. This is analogous to an SQL join. The difference is that SQL operates on sets of relational rows while XQuery operates on sequences of items. In these examples the items are just atomic values to allow for an easy introduction of the language. In the following sections we return to the customer sample data where the items are XML nodes, including elements, attributes, and full documents. db2 => xquery for $i in (1,5,3) for $j in (7,5) where $i = $j return <result>{$i,$j}</result>; <result>5 5</result> 1 record(s) selected. Figure 8.8 Two nested for clauses with a join predicate Since the XQuery let clause does not iterate, it does not contribute to the generation of a Cartesian product of sequences. For example, the query in Figure 8.9 contains a for clause and two let clauses. Each iteration of the for clause leads to one item in the query result. The return
196 Chapter 8 Querying XML Data with XQuery clause constructs result elements. The value of each result element is the sequence of the values of the variables $i, $j, and $k. db2 => xquery for $i in (1,5,3) let $j := ("a","b") let $k := $i *2 return <result>{$i,$j,$k}</result>; <result>1 a b 2</result> <result>5 a b 10</result> <result>3 a b 6</result> 3 record(s) selected. Figure 8.9 A FLWOR expression with for and let clauses All variable names in XQuery have to be preceded by the dollar sign ($). The XQuery standard allows one or multiple spaces between the dollar sign and the beginning of the actual variables, so that both $var and $ var are valid variable names. However, for readability and to avoid confusion it’s best to not use spaces. The same applies to hyphens. Note that $a-b and $ a-b are valid variable names that happen to contain a hyphen. But, a – b is interpreted as an arithmetic operation because there are spaces between the hyphen and the characters a and b. LEARNING XQUERY When it comes to learning a new language there is no better way than learning by doing. We suggest that you download and install the latest version of DB2 Express-C, which is free, so that you can run the XQuery examples in this section hands-on. The examples show that you can explore the behavior of XQuery even without any tables in the database.We encourage you to extend and modify these examples and to try other combinations of for, let, where, order by, and return clauses.You may find that XQuery becomes intuitive quite quickly. 8.3 COMPARING FLWOR EXPRESSIONS, XPATH EXPRESSIONS, AND SQL/XML This section compares and examines XPath, FLWOR, and SQL/XML queries in several ways. We look at traversing XML documents to extract specific elements, coding and placing XML predicates, result set cardinalities, and the integration of FLWOR expressions in SQL statements. We discuss several examples of how “the same” query can be written in several different ways. By “the same” we mean that the same result is returned from the sample data. The examples are not exhaustive; that is, they do not show all possible ways in which a certain query can be written.
8.3 Comparing FLWOR Expressions, XPath Expressions, and SQL/XML 8.3.1 197 Traversing XML Documents Figure 8.10 illustrates five different ways to retrieve the customer name elements. There is no significant performance difference between them, but for readability and maintainability it is a good idea to use as simple a syntax as possible to express a query. Hence, options (4) and (5) are good choices in Figure 8.10. The first FLWOR expression in Figure 8.10 iterates over the customerinfo elements and binds them to the variable $c, one at a time. The return clause then uses $c as the context to navigate to the name element. The second FLWOR expression iterates directly over the name elements and binds them to the variable $n, one at a time. The return clause then only emits the values of $n. The navigation to the name element has shifted from the return clause to the for clause. The third FLWOR expression iterates over the customer documents; that is, over the document nodes that are at the top of each document tree. The return clause then navigates from these document nodes, represented by $i, to the customerinfo/name elements. You will see shortly that the decision of what to iterate over in the for clause makes a difference as soon as you add predicates to the query. The fourth expression is a simple XPath that returns the sequence of all name elements. The fifth query is an SQL/XML statement that uses the XMLQUERY function to extract the name elements. --(1) xquery for $c in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return $c/name; --(2) xquery for $n in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/name return $n; --(3) xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO") return $i/customerinfo/name; --(4) xquery db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/name; --(5) SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer; Figure 8.10 Five different ways to retrieve the customer name elements
198 8.3.2 Chapter 8 Querying XML Data with XQuery Using XML Predicates Figure 8.11 extends the sample queries of Figure 8.10 by adding a predicate to only return the name of the customer whose Cid attribute has the value 1003. All five queries return the same result. Again, the first two FLWOR expressions in Figure 8.11 differ in whether the step to the name element happens in the for or the return clause. This difference affects the where clause, which uses the variable from the for clause. If the for clause assigns the variable $i to customerinfo elements, then the where clause can simply use the XPath $i/@Cid to access the Cid attribute. This is because Cid is a child of customerinfo. The second FLWOR expression, however, binds the variable $i to name elements. This forces the where clause to use a parent step to navigate from $i to the Cid attribute. This is an extra navigation step, which makes the second FLWOR expression slightly more expensive. The third FLWOR expression shows that filtering predicates can not only be located in the where clause but also in the XPath expression of the for clause. In fact, the entire query can again be expressed as a single XPath, which is the fourth query. And finally, the fifth query is an SQL/XML statement, which uses the XMLEXISTS predicate to properly include the filtering condition. --(1) xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo where $i/@Cid = 1003 return $i/name; --(2) xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/name where $i/../@Cid = 1003 return $i; --(3) xquery for $c in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo[@Cid = 1003] return $c/name; --(4) xquery db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo[@Cid = 1003]/name; --(5) SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE XMLEXISTS('$INFO/customerinfo[@Cid = 1003]'); Figure 8.11 Five different ways to apply a predicate
8.3 Comparing FLWOR Expressions, XPath Expressions, and SQL/XML 199 The next example (Figure 8.12) shows four different queries that return phone elements whose attribute type has the value cell. The first FLWOR expression uses two nested for clauses. The outer for clause iterates over the customerinfo elements and assigns them to the variable $c. The inner for clause uses the path $c/phone to iterate over the phone elements of the current customer. For each such phone element, the where clause checks whether the type attribute has the value cell. If so, the return clause returns that phone element. The second FLWOR expression shows that the same query result can be achieved without nested for clauses. It uses only a single for clause to iterate directly over the phone elements. The predicate could be applied in the where clause, but this query adds the predicate to the return clause. You will see later that predicates in the return clause can lead to different query results if element construction is involved. The third query is a simple XPath without any FLWOR clauses. The last query is an SQL/XML statement that uses the XMLTABLE function to produce one result row per cell phone, just like the other queries. --(1) xquery for $c in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo for $p in $c/phone where $p/@type = "cell" return $p; --(2) xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/phone return $i[@type = "cell"]; --(3) xquery db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/phone[ @type="cell"]; --(4) SELECT T.phone FROM customer, XMLTABLE('$INFO/customerinfo/phone[@type="cell"]' COLUMNS phone XML PATH '.') as T; Figure 8.12 Four different queries that return the same phone elements NOTE An advantage of SQL/XML queries is that they can contain parameter markers and host variables in their predicates, as discussed in section 7.6.This is not possible when you use XQuery without SQL.
200 Chapter 8 8.3.3 Querying XML Data with XQuery Result Set Cardinalities in XQuery and SQL/XML Let’s look at result set cardinalities using the three queries in Figure 8.13 as examples. Each of the three queries returns all five customer phone numbers, three from one of our sample documents and two from the other. The first query is an XPath expression that produces a sequence of five text nodes, and each item in that sequence is returned as a separate result row. The second query uses the XMLQUERY function and returns the same five phone numbers in two result rows. The reason is that XMLQUERY is a scalar function in an SQL statement, and scalar functions produce one value for each input row. In our example there are two input rows (documents) and for each of them XMLQUERY produces one sequence of phone numbers. You can turn the items in these sequences into separate rows only if you use a table function (as opposed to a scalar function), which generates a set of rows. This is what the XMLTABLE function does. xquery db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/phone/text(); 905-555-7258 416-555-2937 905-555-8743 905-555-4789 416-555-3376 5 record(s) selected. SELECT XMLQUERY('$INFO/customerinfo/phone/text()') FROM customer; 905-555-7258416-555-2937905-555-8743 905-555-4789416-555-3376 2 record(s) selected. SELECT T.phone FROM customer, XMLTABLE('$INFO/customerinfo/phone' COLUMNS phone VARCHAR(20) PATH '.') as T; 905-555-7258 416-555-2937 905-555-8743 905-555-4789 416-555-3376 5 record(s) selected. Figure 8.13 Three different queries that return the same five phone numbers
8.3 Comparing FLWOR Expressions, XPath Expressions, and SQL/XML 201 A key difference between XPath or XQuery expressions on the one hand and SQL/XML statements on the other is that XPath and XQuery expressions always return a single column of type XML. XQuery cannot return multiple columns in a result set or data types other than XML. SQL/XML statements can read values from XML documents and return them as relational result sets that have multiple columns and traditional SQL data types (see section 7.3, Retrieving XML Values in Relational Format with XMLTABLE). NOTE The examples in this section have shown that many simple queries do not require XQuery FLWOR expressions but can be written much simpler as plain XPath expressions. Indeed, many applications are well-served by combining XPath and SQL and do not necessarily require the extra power of XQuery. However, XQuery has very valuable features that XPath alone does not provide. For example, construction of XML data and joins across multiple XML documents is not possible with XPath alone. Section 8.4 and Chapter 9, Querying XML Data: Advanced Queries and Troubleshooting, provide examples. 8.3.4 Using FLWOR Expressions in SQL/XML Note that SQL/XML and XQuery are not mutually exclusive. Chapter 7 focused on examples that combine XPath and SQL, which is supported both in DB2 for Linux, UNIX, and Windows and DB2 for z/OS. In DB2 for Linux, UNIX, and Windows, the same SQL/XML functions can also take more complex XQuery expressions as input, such as FLWOR expressions. Figure 8.14 shows an example. It returns the name of the customer whose Cid attribute has the value 1003. Remember that the XMLEXISTS predicate is truly an existence check. If the XQuery or XPath expression in the XMLEXISTS returns an empty sequence, then XMLEXISTS evaluates to FALSE and the current row is eliminated. SELECT XMLQUERY('for $i in $INFO/customerinfo/name return $i/text()') FROM customer WHERE XMLEXISTS('let $i := $INFO/customerinfo where $i/@Cid = 1003 return $i'); Figure 8.14 Return the name of the customer whose Cid is 1003 If the same result can be achieved with simple XPath then for simplicity it is recommended to avoid FLWOR expressions in SQL/XML functions. For example, the query in Figure 8.15 is simpler than the query in Figure 8.14 and returns an identical result set.
202 Chapter 8 Querying XML Data with XQuery SELECT XMLQUERY('$INFO/customerinfo/name/text()') FROM customer WHERE XMLEXISTS('$INFO/customerinfo[@Cid = 1003]'); Figure 8.15 A simpler query to return the same result as Figure 8.14 Figure 8.16 provides an example of how you should not integrate XQuery in SQL. The problem with this query is that the predicate on the Cid attribute is included in the FLWOR expression in the SELECT clause of the SQL statement. In this location, the predicate does not eliminate any rows from the customer table. To work as expected, the predicate needs to be in the WHERE clause of the SQL statement, using XMLEXISTS. This issue has been discussed in section 7.5, Common Mistakes with SQL/XML Predicates. SELECT XMLQUERY('for $i in $INFO/customerinfo/name where $i/@Cid = 1003 return $i/text()') FROM customer; Figure 8.16 8.4 Do not place row-filtering predicates in the SELECT clause! CONSTRUCTING XML DATA Constructing XML data in XQuery is easy. You can simply type regular XML tags as part of your XQuery. This method is called direct XML construction. For example, an XML element or document just by itself is already a valid XQuery expression. Figure 8.17 is a simple example where the XQuery consists of nothing but a direct element constructor. The name of the constructed element is title and its value is the literal string Hello. The result of the XQuery is the constructed element itself. This cannot be done with XPath alone. db2 => xquery <title>Hello</title>; <title>Hello</title> 1 record(s) selected. db2 => Figure 8.17 8.4.1 Constructing the element title with the value "Hello" Constructing Elements with Computed Values It is often desirable to generate XML elements whose values are dynamically computed during query execution. Constructed elements can have computed values if they contain XQuery variables or other dynamic expressions. Such expressions must be enclosed in curly brackets and are
8.4 Constructing XML Data 203 often used in the return clause of a FLWOR expression. For example, the query in Figure 8.18 retrieves the name and city values, and returns this information in a newly constructed XML document. xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo where $i/@Cid = 1003 return <quickinfo> <custname>{$i/name/text()}</custname> <custcity>{$i/addr/city/text()}</custcity> </quickinfo>; <quickinfo><custname>Robert Shoemaker</custname><custcity>Auro ra</custcity></quickinfo> 1 record(s) selected. Figure 8.18 Construction of an XML document with dynamic values Several things are noteworthy about Figure 8.18. The returned XML data uses XML element names that do not exist in the XML documents that are stored in the table. In other words, the query reads one XML format but returns another. This performs a transformation of the data. Although XQuery is not always a substitute for XSLT (Extensible Stylesheet Language Transformations), it can carry out many transformations easily and efficiently. In contrast to Figure 8.17, the values of the constructed elements in Figure 8.18 are not provided as literal strings but computed by XPath expressions. These XPath expressions must be enclosed in curly brackets to indicate that they are to be evaluated and not used as literal string values. If you forget the curly brackets, the query result contains the actual path expressions, which is not useful: <quickinfo><custname>$i/name/text()</custname<custcity>$i/addr/city/text()</ custcity></quickinfo> The XPath expressions within the constructed elements use /text() as the last step in the path. This way they only retrieve the text node value of the name and city elements, but not the elements themselves. If you do not use /text() then the original XML elements name and city are included in the constructed XML document. This behavior is demonstrated by the query in Figure 8.19, which only constructs the quickinfo element and inserts the existing elements name and city into it.
204 Chapter 8 Querying XML Data with XQuery xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo where $i/@Cid = 1003 return <quickinfo> {$i/name}{$i/addr/city} </quickinfo>; <quickinfo><name>Robert Shoemaker</name><city>Aurora</city></qui ckinfo> 1 record(s) selected. Figure 8.19 Expressions that include existing elements in constructed elements You can produce the same query result with the SQL/XML statement in Figure 8.20. Since direct element constructors are XQuery expressions, and any XQuery expression can be embedded in SQL/XML, simply use direct element construction in the XMLQUERY function as needed. Adjust the XPath expressions to use the INFO column as the starting point (context) for navigation. One benefit of the SQL/XML query is that you can now use parameter markers or host variables in the XMLEXISTS predicate, if desired. SELECT XMLQUERY('<quickinfo> {$INFO/customerinfo/name} {$INFO/customerinfo/addr/city} </quickinfo>') FROM customer WHERE XMLEXISTS('$INFO/customerinfo[@Cid = 1003]'); Figure 8.20 8.4.2 Direct element constructors embedded in SQL/XML Constructing XML Data with Predicates and Conditions Both the content as well as the tag names of constructed elements can be controlled with predicates and conditional expressions (if-then-else). Let’s look at some examples. In section 8.3 we compared various queries by moving navigation steps and predicates from one clause to another. For example, the second FLWOR expression in Figure 8.12 iterates over phone elements and applies a predicate in the return clause. This has a significant effect if element construction comes into play, as in Figure 8.21. Since there is no predicate in the for or where clauses, the query constructs a result document for every item of the iteration; that is, for every phone element. If you prefer to produce one result per customer, then the for clause should iterate over customerinfo, not over phone. Second, the predicate that selects cell phones is within the constructed element cellphone. This element is constructed regardless of the evaluation of the predicate. Hence, the query result contains empty cellphone elements for every phone number that’s not a cell phone.
8.4 Constructing XML Data 205 xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/phone return <quickinfo> {$i/../name} <cellphone>{$i[@type="cell"]/text()}</cellphone> </quickinfo>; ----------------------<quickinfo><name>Robert Shoemaker</name><cellphone/></quickinfo> <quickinfo><name>Robert Shoemaker</name><cellphone/></quickinfo> <quickinfo><name>Robert Shoemaker</name><cellphone>905-5558743</cellphone></quickinfo> <quickinfo><name>Matt Foreman</name><cellphone/></quickinfo> <quickinfo><name>Matt Foreman</name><cellphone/></quickinfo> 5 record(s) selected. Figure 8.21 The effect of predicates within element constructors The if-then-else expression in XQuery allows you to generate XML tags conditionally based on value predicates. For each phone element, the query in Figure 8.22 creates an info element that contains the customer Cid attribute as well as another element with the phone number. The name of this element depends on the value of the type attribute of the original phone element. If the type is cell, the constructed element is called cellphone. If the type is work, the constructed element is workphone, and so on. The nesting of the if-then-else expressions is necessary because XQuery does not have an elseif or case construct. xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/phone return <info> {$i/../@Cid} {if ($i/@type="cell") then <cellphone>{$i/text()}</cellphone> else if ($i/@type="work") then <workphone>{$i/text()}</workphone> else if ($i/@type="home") then <homephone>{$i/text()}</homephone> else <phone>{$i/text()}</phone> } </info>; <info <info <info <info <info Cid="1003"><workphone>905-555-7258</workphone></info> Cid="1003"><homephone>416-555-2937</homephone></info> Cid="1003"><cellphone>905-555-8743</cellphone></info> Cid="1004"><workphone>905-555-4789</workphone></info> Cid="1004"><homephone>416-555-3376</homephone></info> 5 record(s) selected. Figure 8.22 Conditional construction of XML elements
206 Chapter 8 Querying XML Data with XQuery In Figure 8.22, note that the expression {$i/../@Cid} produces an attribute node. This attribute automatically becomes an attribute of the parent element (info). Within an element constructor such as <info></info>, it is mandatory that any such attribute nodes appear before any element nodes. The query in Figure 8.23 fails because the attribute expression {$i/@Cid} appears after the element expression {$i/name}. Reverse the two and the query works fine. xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return <info>{$i/name}{$i/@Cid}</info>; SQL16015N An element constructor contains an attribute node named "Cid" that follows an XQuery node that is not an attribute node. Figure 8.23 8.4.3 Construction fails if attributes don’t appear first! Constructing Documents with Multiple Levels of Nesting Assume that you are asked to construct a summary document that contains names and phone numbers, except home phones, for all Canadian customers, exactly as defined in Figure 8.24. Note that the desired output is a single document, not one document per customer. Also, the customer names are requested to be attributes and not repeated for every phone number. This means that you have to construct elements at three levels: the quickinfo element, which includes all customers; then one contact element per customer at the second level; and finally one telephone element for every phone that’s not a home phone. <quickinfo> <contact name="Robert Shoemaker"> <telephone>905-555-7258</telephone> <telephone>905-555-8743</telephone> </contact> <contact name="Matt Foreman"> <telephone>905-555-4789</telephone> </contact> </quickinfo> Figure 8.24 Summary document with work and cell phone numbers The query in Figure 8.25 constructs the document shown in Figure 8.24. The query begins with the construction of the top-level quickinfo element. The for clause that iterates over the customers is within the construction of the quickinfo element to achieve the desired document structure. This embedded FLWOR expression is enclosed in curly brackets because it needs to be evaluated and should not be taken as a literal string value for the quickinfo element. The return clause of the FLWOR expression produces one contact element per customer, with a name attribute. Constructing this new attribute is straightforward. In place of the attribute value the query simply uses the expression {$i/name/text()} to compute the desired attribute value. Finally, the contact element includes another FLWOR expression that iterates over the
8.4 Constructing XML Data 207 phone elements of the current customer and produces one telephone element for every non- home phone. xquery <quickinfo>{ for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo where $i/addr/@country = "Canada" return <contact name="{$i/name/text()}"> {for $p in $i/phone where $p/@type != "home" return <telephone>{$p/text()}</telephone> } </contact> } </quickinfo>; Figure 8.25 Construction of the document in Figure 8.24 Note how the structure of the XQuery expression in Figure 8.25 corresponds to the structure of the generated document in Figure 8.24. 8.4.4 Constructing Documents with XML Aggregation in SQL/XML Queries It is also possible to generate the document in Figure 8.24 with an SQL/XML statement, such as the one in Figure 8.26. This query uses a subquery and XML aggregation to produce the desired document structure. The XMLQUERY function in the subselect constructs the contact elements much like the return clause in Figure 8.25. The extra challenge is to combine the contact elements for all customers into a single document. Without the XMLAGG function, the subquery would produce each contact element in a separate row. The purpose of any aggregation function in SQL is to combine values from multiple rows into a single value. This is exactly what XMLAGG does for values of type XML. It aggregates the generated contact elements into a single XML sequence. This sequence is a value in a single row of a single column, called contactinfo. This column is of type XML and referenced in the outer SELECT clause to produce the content of the generated quickinfo element. More information on XMLAGG is provided in Chapter 10, Producing XML from Relational Data. SELECT XMLQUERY('<quickinfo>{$CONTACTINFO}</quickinfo>') FROM ( SELECT XMLAGG( XMLQUERY('<contact name="{$INFO/customerinfo/name}"> {for $p in $INFO/customerinfo/phone where $p/@type != "home" return <telephone>{$p/text()}</telephone> } </contact>') ) as contactinfo FROM customer WHERE XMLEXISTS('$INFO/customerinfo[addr/@country = "Canada"]')) Figure 8.26 SQL/XML query to construct the document in Figure 8.24
208 Chapter 8 Querying XML Data with XQuery You may find that constructing complex XML documents with nested and repeating elements is often more intuitive in XQuery than SQL/XML. However, SQL/XML makes it easier to include values from relational columns in the constructed XML data. As a simple example, let’s extend the query in Figure 8.26 such that the value of the relational id column of the customer table is shown as an attribute custid of the contact element. In SQL/XML, you can simply add the custid attribute and reference the relational column with $ID, as shown in Figure 8.27. This would be significantly more complex in the XQuery version of this query (refer to Figure 8.25), involving the use of the function db2fn:sqlquery to embed an SQL/XML statement inside the XQuery. SELECT XMLQUERY('<quickinfo>{$CONTACTINFO}</quickinfo>') FROM ( SELECT XMLAGG( XMLQUERY('<contact name="{$INFO/customerinfo/name}" custid = "{$ID}"> {for $p in $INFO/customerinfo/phone where $p/@type != "home" return <telephone>{$p/text()}</telephone> } </contact>') ) as contactinfo FROM customer WHERE XMLEXISTS('$INFO/customerinfo[addr/@country = "Canada"]')) Figure 8.27 Constructing an attribute with a value from a relational column In this section we have described direct constructors for XML elements and attributes. Similarly you can construct XML comments and processing instructions, if needed. When you have to construct XML data from a mix of XML and relational source values, the use of SQL/XML is recommended. Chapter 10 describes additional cases and capabilities for constructing XML data. 8.5 DATA TYPES, CAST EXPRESSIONS, AND TYPE ERRORS For the discussion of XQuery types, cast operations, and arithmetic expressions we use the purchaseorder table from the DB2 sample database. It has an XML column porder; Figure 8.28 shows one of the documents it contains. We assume the purchase order documents are inserted into the table without schema validation.
8.5 Data Types, Cast Expressions, and Type Errors 209 <PurchaseOrder PoNum="5000" OrderDate="2006-02-18" Status="Unshipped"> <item> <partid>100-100-01</partid> <name>Snow Shovel, Basic 22 inch</name> <quantity>3</quantity> <price>9.99</price> </item> <item> <partid>100-103-01</partid> <name>Snow Shovel, Super Deluxe 26 inch</name> <quantity>5</quantity> <price>49.99</price> </item> </PurchaseOrder> Figure 8.28 Sample document in the purchaseorder table The data types used in the XQuery language consist of two sets of types: • The built-in data types that are defined in the XML Schema specification. These XML Schema types are in the namespace http://www.w3.org/2001/XMLSchema, which has the pre-declared namespace prefix xs. • The predefined types of XQuery, which are in the namespace http://www.w3.org/ 2005/xpath-datatypes with the predeclared prefix xdt. Some of the most commonly used types include • xs:integer • xs:date • xs:decimal • xs:time • xs:double • xs:dateTime • xs:string • xs:duration • xs:base64Binary • xdt:dayTimeDuration • xs:hexBinary • xdt:yearMonthDuration • xs:boolean The complete list of XQuery data types is documented in the DB2 information center. When you query the purchase order documents you probably want to treat the PoNum attribute as a numeric value, the OrderDate as a date value, the Status as a character string, the price as a decimal or double precision value, and so on. Luckily, most of that happens automatically. For example, Figure 8.29 shows a query with two predicates, @PoNum = 5000 and @Status = "Unshipped". Since the literal value 5000 is not in quotes, it is interpreted as a numeric value.
210 Chapter 8 Querying XML Data with XQuery Thus, DB2 automatically casts the value of the attribute PoNum to xs:double to perform a numeric comparison against the value 5000. No explicit casting is necessary. Similarly, the literal value "Unshipped" is recognized as type xs:string, which causes the values of the attribute Status also to be cast to xs:string for a textual comparison of the two. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder[@PoNum = 5000 and @Status = "Unshipped"] return $i; Figure 8.29 XQuery with a numeric predicate and a string predicate However, occasionally it is necessary to convert a value to a specific data type. The query in Figure 8.30 tries to retrieve all purchase orders where the OrderDate is "2006-02-18". The literal value in quotes is interpreted as a character string, which leads to a textual comparison using the type xs:string. But textual comparisons do not follow the same semantics as date comparisons, so that the query in Figure 8.30 can potentially return a logically incorrect result. For example, if the OrderDate value in the document in Figure 8.28 had a time zone indicator, then this query with string comparison would not return that document. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder/@OrderDate = "2006-02-18" return $i; Figure 8.30 XQuery with a string comparison—not a date comparison! Similarly, the query in Figure 8.31 returns the document in Figure 8.28 only because the literal value in the query is cast to xs:date. The Z at the end of the date value is the time zone indicator for UTC time. UTC stands for Coordinated Universal Time, which is the same as Greenwich Mean Time. The string representation of this date value is different from the one in the document in Figure 8.28, but when cast to xs:date they represent the same logical date. Also, if an XML index of type DATE is defined on /PurchaseOrder/@OrderDate, the query in Figure 8.31 can use the index, because the type of the predicate matches the type of the index, but the query in Figure 8.30 cannot use the index. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder/@OrderDate = xs:date("2006-02-18Z") return $i; Figure 8.31 XQuery with a date comparison In such cases, the casting has to be applied to the literal value. In Figure 8.32, the casting is wrongly applied to the OrderDate attribute instead, and the query fails with error SQL16003N. The problem is that the left side of the predicate is of type xs:date, while the right side is of type xs:string. The leads to a type error at query runtime.
8.5 Data Types, Cast Expressions, and Type Errors 211 xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder/xs:date(@OrderDate) = "2006-02-18Z" return $i; SQL16003N An expression of data type "xs:string" cannot be used when the data type "xs:date" is expected in the context. Error QName=err:XPTY0004. SQLSTATE=10507 Figure 8.32 Cannot compare xs:string to xs:date! Similarly, the XQuery in Figure 8.33 also fails with a type error. The literal value 10 is numeric because it is not in quotes. Hence, DB2 tries to perform a numeric comparison of type xs:double. However, the value “Unshipped” of the Status attribute cannot be cast to any numeric data type, so the comparison fails. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder/@Status > 10 return $i; SQL16061N The value "Unshipped" cannot be constructed as, or cast (using an implicit or explicit cast) to the data type "xs:double". Error QName=err:FORG0001. SQLSTATE=10608 Figure 8.33 Cannot compare xs:string to xs:double! What if you have some documents where the Status attribute contains numeric values and some documents where it contains alphanumeric string values? In that case you might still want to use the query in Figure 8.33 to find all orders whose Status has a numeric value greater than 10. You can use the XQuery expression castable together with the if-then-else expression to apply the numeric predicate only if the Status attribute of a given document is a valid integer number. For all other documents the value false is produced to exclude them from the result set. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder/( if (@Status castable as xs:integer) then (@Status > 10) else false ) return $i; Figure 8.34 XQuery with the expression castable The SQL/XML statement in Figure 8.35 intends to read all purchase orders where the first item in the order is less expensive than the second item. Clearly, the purchase order in Figure 8.28 should be in the result set because the price of its first item is 9.99 while the price of the second item is 49.99. But, opposite to what you might expect, the predicate in Figure 8.35 does not select the purchase order in Figure 8.28. Let’s examine why that is. First of all, note that the predicate [item[1]/price < item[2]/price] does not include any literal value that could provide an indication of the data type of the comparison. Hence, according to the XQuery standard, DB2
212 Chapter 8 Querying XML Data with XQuery simply performs a string comparison, and the string “9.99” is greater than the string “49.99”. In summary, the query in Figure 8.35 runs, but does not work the way you want. SELECT porder FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[ item[1]/price < item[2]/price]'); Figure 8.35 String comparison between two elements in a document The solution is to cast either the left side of the predicate, or the right side, or both to xs:double, as shown in Figure 8.36. If at least one of the two operands is cast to a specific data type, then this determines the data type of the comparison operation and DB2 tries to cast the other operand to the same data type. Consequently, the query in Figure 8.36 performs a numeric comparison of the two price elements and therefore includes the purchase order in Figure 8.28 in the result set, as expected. SELECT porder FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[ item[1]/xs:double(price) < item[2]/price]'); Figure 8.36 Numeric comparison between two elements in a document Note that the casting functions, which are actually called type constructors, can only cast at most one item at a time. The following expression would fail because one purchase order contains multiple item elements, and a sequence of two or more items cannot be cast to a double value. xs:double($i/PurchaseOrder/item/price) To cast all items in the sequence, use the type constructor at the end of the XPath expressions, such as the following: $i/PurchaseOrder/item/xs:double(price) $i/PurchaseOrder/item/price/xs:double(.) 8.6 ARITHMETIC EXPRESSIONS XQuery provides arithmetic operators for addition (+), subtraction (–), multiplication (*), division (div), integer division (idiv), and modulus (mod). A subtraction operator must be preceded by whitespace if it could otherwise be interpreted as part of a variable or tag name. For example, price-discount will be interpreted as a single name, but price -discount and price - discount will be interpreted as arithmetic expressions between two separate items. Arithmetic operators can be used with elements, attributes, or a mix of both.
8.6 Arithmetic Expressions 213 Figure 8.37 provides two examples, one in SQL/XML and one in XQuery notation. Both multiply the quantity and the price of each item in the purchase order that has PoNum=5000. Note that the for clause of the XQuery iterates over item elements and computes the value of each item separately. SELECT T.id, T.itemvalue FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS id VARCHAR(15) PATH 'partid', value DECIMAL(9,2) PATH 'quantity * price') as T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum= 5000]'); ID ITEMVALUE --------------- ----------100-100-01 29.97 100-103-01 249.95 2 record(s) selected. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder[@PoNum= 5000]/item let $q := $i/quantity let $p := $i/price return <itemValue id="{$i/partid}">{$q * $p}</itemValue>; <itemValue id="100-100-01">29.97</itemValue> <itemValue id="100-103-01">249.95</itemValue> 2 record(s) selected. Figure 8.37 SQL/XML and XQuery with arithmetic expression The first step in evaluating an arithmetic expression is to evaluate its operands. If one of the operands is an empty sequence, the result of the arithmetic expression is also an empty sequence. If one of the operands is a sequence of more than one item, a type error is raised. This happens in Figure 8.38. This query iterates over purchase orders, not over items. Since a purchase order typically has multiple items, the let clauses bind a sequence of multiple quantity elements to $q and a sequence of multiple price elements to $p. This leads to an error in the multiplication in the return clause. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder[@PoNum= 5000] let $q := $i/item/quantity let $p := $i/item/price return <itemValue id="{$i/item/partid}">{$q * $p}</itemValue>; SQL16003N An expression of data type "( item(), item()+ )" cannot be used when the data type "item()" is expected. Figure 8.38 Operands in an arithmetic expression must be zero or one item
214 Chapter 8 Querying XML Data with XQuery An error is also raised if one of the operands cannot be cast to xs:double. For example, if a quantity element contains the string value “five” then the arithmetic expression fails at runtime. XQuery provides a division operator (div) and an integer division operator (idiv). The latter simply casts its result to type xs:integer. For example, the expression 5 div 2 returns the value 2.5, whereas the expression 5 idiv 2 produces the value 2. The idiv operator always rounds down to next integer value, which is forced by the cast to xs:integer. For testing purposes you can run XQuery expressions with cast and arithmetic operations in the DB2 Command Line Processor, such as in Figure 8.39. xquery xs:integer(3.9); 3 1 record(s) selected. xquery 10 + 100 idiv 9; 21 1 record(s) selected. Figure 8.39 8.7 Testing XQuery expressions in the CLP XQUERY FUNCTIONS The XQuery language provides a large number of built-in functions. These include aggregate functions such as count and sum, string functions such as contains and starts-with, functions to manipulate date and timestamp values, numeric functions, and others. A complete discussion of all functions is beyond the scope of this book. Appendix C, Further References, contains pointers to the complete reference of all supported XPath and XQuery functions in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. In this section we list only a subset of the available XQuery functions to highlight those that are most frequently used and have been found useful in DB2 pureXML production applications. We provide some examples and encourage you to try more functions and queries hands-on with the DB2 sample database. In general, all functions can be applied to elements as well as to attributes. We categorize the discussion of XQuery functions as follows: • String functions (section 8.7.1) • Number and aggregation functions (section 8.7.2) • Sequence functions (section 8.7.3)
8.7 XQuery Functions 215 • Node and namespace functions (section 8.7.4) • Date and time functions (section 8.7.5) • Boolean functions (section 8.7.6) All XQuery functions belong to a default namespace that is always implicitly bound to the namespace prefix fn. Since it is a default namespace, the prefix can be omitted. For example, concat and fn:concat refer to the same concatenation function. 8.7.1 String Functions Some of the most commonly used string functions are listed in Table 8.1. Table 8.1 Commonly Used String Functions String Functions Description concat The function fn:concat returns a string that is the concatenation of two or more atomic values. string-join The function fn:string-join takes as input a sequence of string values and a separator character. It returns a single string in which the input strings are concatenated but separated by the separator character. contains The function fn:contains returns true if a string contains a given substring. matches The function fn:matches returns true if a string matches a given regular expression. starts-with The function fn:starts-with returns true if a string begins with a given substring. ends-with The function fn:ends-with returns true if a string ends with a given substring. lower-case The function fn:lower-case converts a string to lowercase. upper-case The function fn:upper-case converts a string to uppercase. translate The fn:translate function replaces selected characters in a string with replacement characters. string The function fn:string returns the string representation of a value. string-length The function fn:string-length returns the length of a string. substring The function fn:substring returns a substring of a string, based on a start position and a length. It is similar to the substr function in SQL. substring-after The function fn:substring-after returns the tail of the input string after the first occurrence of a given search string. (continues)
216 Chapter 8 Table 8.1 Querying XML Data with XQuery Commonly Used String Functions (Continued) String Functions Description substring-before The function fn:substring-before returns the beginning of the input string up to (but excluding) the first occurrence of a given search string. tokenize The function fn:tokenize breaks a string into a sequence of substrings. normalize-space The function fn:normalize-space strips leading and trailing whitespace characters from a string and replaces each internal sequence of whitespace characters with a single space character. A simple example of the concat function is shown in Figure 8.40. Here, the concat function has four arguments. The first and third arguments are literal string values, while the second and fourth parameters are expressions based on the variable $i that is bound in the for clause. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder/item where $i/../@PoNum=5000 return concat("Order ",$i/../@PoNum," – Item ",$i/partid); Order 5000 - Item 100-100-01 Order 5000 - Item 100-103-01 2 record(s) selected. Figure 8.40 Concatenation of string literals and expressions Figure 8.41 demonstrates three string functions. The query uses the concat function to concatenate the values of the attributes PoNum and Status into a single string. In the second column it utilizes the string-join function to produce a list of partid values that are separated by the semicolon. Note that the arguments of the concat functions are single values while the first argument of the string-join function evaluates to a sequence of multiple elements. The contains function in the WHERE clause restricts the result set to purchase orders that have at least one item whose name contains the word “Super”. SELECT XMLQUERY('$PORDER/PurchaseOrder/concat(@PoNum,@Status)') AS id, XMLQUERY('string-join($PORDER/PurchaseOrder/item/partid,";")') AS items FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder/item[contains(name,"Super")]'); Figure 8.41 Query with three XQuery string functions
8.7 XQuery Functions IDSTATUS -----------------5000Unshipped 5001Shipped 5004Shipped 217 ITEMS -----------------------------------------100-100-01;100-103-01 100-101-01;100-103-01;100-201-01 100-100-01;100-103-01 3 record(s) selected. Figure 8.41 Query with three XQuery string functions (Continued) XQuery functions can be nested. The query in Figure 8.42 returns the name of an item from purchase order 5000, if the item name contains a comma and contains the word Basic after the comma. The function substring-after is the first argument of the contains function and produces the part of the name after the comma. Thus, the contains function is applied only to that second part of each item name. SELECT XMLQUERY('$PORDER/PurchaseOrder/item[ contains(substring-after(name,","), "Basic")]/name') FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]'); <name>Snow Shovel, Basic 22 inch</name> 1 record(s) selected. Figure 8.42 Query with nested XQuery string functions You can use the function tokenize to split a string into multiple smaller strings. For example, the query in Figure 8.43 splits the values of the partid elements based on the occurrences of the “-” character. The function returns the substrings as a sequence. Instead of using a single character to split the input string, you can also tokenize a string based on the occurrences of a substring or regular expression. xquery db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder[ @PoNum=5000]/item/tokenize(partid,"-"); 100 100 01 100 103 01 6 record(s) selected. Figure 8.43 Splitting a string into a sequence of separate items
218 Chapter 8 Querying XML Data with XQuery Although the query in Figure 8.43 returns the tokenized substrings in separate rows, it can be more useful to return them in separate columns instead, which happens in Figure 8.44. The query in Figure 8.44 uses the XMLTABLE function to generate one row per order item. Each generated row has an INTEGER column called OrderNo and an XML column called partid. The INTEGER column contains the purchase order number (PoNum), and the XML column contains the sequence of substrings produced by the tokenize function. In the SELECT clause, this XML column is not returned as-is, but used as input to each of three XMLQUERY functions. They use positional predicates [1], [2], and [3], respectively, to obtain the first, second, and third token of the sequence separately. SELECT T.orderno, XMLCAST(XMLQUERY('$PARTID[1]') as CHAR(3)) as id1, XMLCAST(XMLQUERY('$PARTID[2]') as CHAR(3)) as id2, XMLCAST(XMLQUERY('$PARTID[3]') as CHAR(3)) as id3 FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS OrderNo INTEGER PATH '../@PoNum', partid XML PATH 'tokenize(partid,"-")') as T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]'); ORDERNO ------------5000 5000 ID1 --100 100 ID2 --100 103 ID3 --01 01 2 record(s) selected. Figure 8.44 Splitting a string into separate columns We encourage you to try other string functions on your own. For example, use the translate function to change the delimiter in the partid values from 100-103-01 to 100/103/01. Or, use the starts-with function to find all items whose name begins with the word “Snow”. 8.7.2 Number and Aggregation Functions Let’s turn to numeric XQuery functions, some of which are shown in Table 8.2. Table 8.2 Commonly Used Number and Aggregation Functions Numeric and Aggregation Functions Description sum The function fn:sum returns the sum of the values in a sequence. avg The function fn:avg returns the average of the values in a sequence.
8.7 XQuery Functions Table 8.2 219 Commonly Used Number and Aggregation Functions (Continued) Numeric and Aggregation Functions Description max The function fn:max returns the maximum of the values in a sequence. min The function fn:min returns the minimum of the values in a sequence. abs The function fn:abs returns the absolute value of a numeric value. round The function fn:round returns the integer that is closest to the given numeric value. Figure 8.45 shows two XQuery expressions with number and string functions. The first one returns the sum of item prices for each purchase order where the value of the Status attribute starts with “Ship”. For example, this includes orders where the status is Shipped or Shipping. A separate sum is computed for the items within each such purchase order. The second query computes the average item price across all orders that match the starts-with predicate. A single average value is computed for these orders, because the XPath expression that produces the sequence of purchase orders is the argument of the avg function. xquery db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder[ starts-with(@Status,"Ship")]/sum(item/price); 73.97 33.97 59.98 33.97 4 record(s) selected. xquery avg( db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder[ starts-with(@Status,"Ship")]/item/price ); 18.3536363636364 1 record(s) selected. Figure 8.45 The XQuery aggregation functions sum and avg The same two queries can be coded in SQL/XML notation, as shown in Figure 8.46. They produce the same results as their counterparts in Figure 8.45. Note that the second SELECT statement in Figure 8.46 uses the SQL function AVG, not the XQuery function avg.
220 Chapter 8 Querying XML Data with XQuery SELECT XMLQUERY('$PORDER/PurchaseOrder/sum(item/price)') FROM purchaseorder WHERE XMLEXISTS ('$PORDER/PurchaseOrder[starts-with(@Status,"Ship")]'); SELECT AVG(T.itemprice) FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS itemprice DECIMAL(9,2) PATH 'price') AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder[starts-with(@Status,"Ship")]'); Figure 8.46 The XQuery functions sum and the SQL functions avg In Figure 8.45 and Figure 8.46 you can replace the functions sum and avg with the function count to obtain the number of elements rather than the sum or average of their values. Try it out. 8.7.3 Sequence Functions The count function is not a numeric function but a sequence function (see Table 8.3) because it counts the number of items in a sequence. Table 8.3 Commonly Used Sequence Functions Date and Time Functions Description count The function fn:count returns the number of items in a sequence. data The function fn:data returns the input sequence but replaces any nodes in the sequence with their values. distinct-values The function fn:distinct-values returns the distinct values in a sequence. It is similar to the SQL function distinct. deep-equal The function fn:deep-equal compares two documents or sequences and returns true if they meet the requirements for deep equality. Roughly speaking, two documents or sequences are deep equal if every aspect of their structure, values, and data type is equal. empty The function fn:empty returns true if the argument is an empty sequence. exactly-one The function fn:exactly-one returns its argument if the argument contains exactly one item. zero-or-one The function fn:zero-or-one returns its argument if the argument contains one item or is an empty sequence. one-or-more The function fn:one-or-more returns its argument if the argument is a sequence of one or more items.
8.7 XQuery Functions Table 8.3 221 Commonly Used Sequence Functions (Continued) Date and Time Functions Description last The function fn:last takes no parameters but returns the number of items in the sequence that is currently being processed. It is usually used in a positional predicate to return the last item in a sequence. position The function fn:position returns the position of the context item in the sequence that is currently being processed. Figure 8.47 shows three examples that use sequence functions. The goal is to find all the different values that Status attributes in purchase orders can have. The first XQuery in Figure 8.47 returns the value of the Status attribute from all purchase orders in the purchaseorder table. It uses the function data to obtain the attribute values instead of the attribute nodes. The second XQuery uses the distinct-values function to retrieve unique Status values only. The result shows that the sample data contains two different spellings of the value Unshipped, one with lowercase s and one with uppercase S. To address this, the third XQuery uses the string function upper-case to convert all Status values to uppercase. The SQL/XML statement in Figure 8.48 produces the same result by using the SQL functions DISTINCT and UPPER. xquery db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder/data(@Status); Unshipped Shipped Shipped UnShipped Shipped Shipped 6 record(s) selected. xquery distinct-values(db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder/@Status); Unshipped Shipped UnShipped 3 record(s) selected. xquery distinct-values(db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder/upper-case(@Status)); UNSHIPPED SHIPPED 2 record(s) selected. Figure 8.47 Using the XQuery sequence functions data() and distinct-values()
222 Chapter 8 Querying XML Data with XQuery SELECT DISTINCT(UPPER(T.stat)) FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder' COLUMNS stat VARCHAR(15) PATH '@Status') AS T; Figure 8.48 Using the SQL function DISTINCT The SQL/XML statement in Figure 8.49 returns the first and the last item of purchase order 5000 in two separate columns of type XML. The function last(), with no argument, returns the number of items in the sequence and therefore points to the last item. SELECT XMLQUERY('$PORDER/PurchaseOrder/item[1]'), XMLQUERY('$PORDER/PurchaseOrder/item[last()]') FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]'); Figure 8.49 8.7.4 Positional predicates to obtain the first and last items Namespace and Node Functions Some commonly used namespace and node functions are listed in Table 8.4. The namespace functions are discussed in Chapter 15, Managing XML Data with Namespaces. Table 8.4 Commonly Used Namespace and Node Functions Name and Node Functions Description name The function fn:name returns the name of a node, typically an element or attribute name. The returned name includes the namespace prefix of the node, if applicable. local-name The function fn:local-name returns the name of a node, but does not include a namespace prefix. namespace-uri The function fn:namespace-uri returns the namespace URI of the given node. namespace-urifor-prefix The function fn:namespace-uri-for-prefix returns the namespace URI that is associated with a namespace prefix for an element. in-scope-prefixes The function fn:in-scope-prefixes returns a list of prefixes for all inscope namespaces of an element. The functions name and local-name are very powerful because they allow access to element and attribute names. In contrast, all previous queries in this chapter used element and attribute
8.7 XQuery Functions 223 names only to get to their values. As an example, the XMLTABLE function in Figure 8.50 iterates over all the child elements of the item elements of purchase order 5000. For each child element it returns the element’s name and value together with the PoNum of the purchase order. Note that the row-generating expression ends with a wildcard that selects all child elements under item. The expressions 'local-name(.)' and '.' in the column definitions use the dot to refer to whatever the current child element is. SELECT T.OrderNo, T.node, T.value FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item/*' COLUMNS OrderNo INTEGER PATH '../../@PoNum', node VARCHAR(10) PATH 'local-name(.)', value VARCHAR(40) PATH '.' ) AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]'); ORDERNO NODE VALUE ---------- ---------- -------------------------------------5000 partid 100-100-01 5000 name Snow Shovel, Basic 22 inch 5000 quantity 3 5000 price 9.99 5000 partid 100-103-01 5000 name Snow Shovel, Super Deluxe 26 inch 5000 quantity 5 5000 price 49.99 8 record(s) selected. Figure 8.50 Producing a list of element names and values Similarly you can use the function local-name to produce a list of all tags that occur in a given document. This is shown in Figure 8.51. The row-generating expression of the XMLTABLE function is //(*, @*). To understand what this means, remember that //* selects all elements at all levels of the document, and //@* selects all attributes at all levels of the document. In the expression //(*, @*) the parentheses and the comma construct a sequence that combines all elements and all attributes at all levels. In short, the row-generating expression produces all elements and attributes of the document. The column seq indicates the order in which the nodes appear in the document, and the column node produces their names. The column type determines whether the node is an attribute, an element, or a leaf element. The if-then-else expression uses the node test self::attribute() which evaluates to true if the node is an attribute. The else branch contains another if-then-else expression to check whether the current node has any element children. If yes, it must be an element itself. Otherwise it’s considered a leaf-element.
224 Chapter 8 SELECT T.* FROM purchaseorder, XMLTABLE('$PORDER//(*, @*)' COLUMNS seq FOR ORDINALITY, node VARCHAR(20) PATH type VARCHAR(15) PATH 'local-name(.)', 'if (self::attribute()) then "Attribute" else (if (./*) then "Element" else "Leaf-Element")' ) AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]'); SEQ NODE ----------- -----------------1 PurchaseOrder 2 PoNum 3 OrderDate 4 Status 5 item 6 partid 7 name 8 quantity 9 price 10 item 11 partid 12 name 13 quantity 14 price TYPE -----------------Element Attribute Attribute Attribute Element Leaf-Element Leaf-Element Leaf-Element Leaf-Element Element Leaf-Element Leaf-Element Leaf-Element Leaf-Element 14 record(s) selected. Figure 8.51 8.7.5 Querying XML Data with XQuery Producing a list of all element and attribute names Date and Time Functions Some noteworthy date and time functions are listed in Table 8.5.
8.7 XQuery Functions Table 8.5 225 Commonly Used Date and Time Functions Date and Time Functions Description adjust-date-totimezone The function fn:adjust-date-to-timezone adjusts an xs:date value to a specific time zone, or removes the timezone component from the value. Similar functions exist for xs:time and xs:dateTime values. current-date, current-time, current-dateTime These functions return the current date, time, or date and time in the UTC timezone (UTC = Coordinated Universal Time, which is Greenwich Mean Time). current-local-date, current-local-time, current-local-dateTime These functions return the current date, time, or date and time in the local time zone of the operating system, without time zone indicator. (DB2 for Linux, UNIX, Windows, version 9.5 FP5, and 9.7 FP1.) dateTime The function fn:dateTime constructs an xs:dateTime value from an xs:date value and an xs:time value. day-from-date The function fn:day-from-date returns the day component of an xs:date value. Similar functions exist to extract the months or year from an xs:date value, or to extract the hours, minutes, seconds, or timezone from xs:time or xs:dataTime values. An example of an SQL/XML query that manipulates dates is shown in Figure 8.52. The goal of the query is to list the identifier, order date, year, and age of all orders that are older than 90 days. Let’s look at the predicate in the WHERE clause first. The predicate selects all orders whose OrderDate attribute is less than the current date minus 90 days. The string literal P90D denotes a duration of 90 days. The P is the duration indicator, and 90D specifies the length of the duration. Similarly, the string P2DT5H45M could be used to denote a duration of 2 days, 5 hours, and 45 minutes. Any such duration string needs to be cast to the type xdt:dayTimeDuration to be interpreted as a duration and not as xs:string. This casting allows you to subtract the duration from the current date to produce a date in the past (90 days ago). For each matching order, the XMLTABLE function in Figure 8.52 extracts the OrderDate, the year portion of the date, and the age of the order. The age is calculated by subtracting the current date from the order date. Subtraction of one date from another produces a duration. In this example, the returned durations are negative, because current-date() is always larger than any existing OrderDate. The query result shows, for example, that purchase order 5000 has been placed 1069 days prior to January 21, 2009.
226 Chapter 8 Querying XML Data with XQuery SELECT poid, CURRENT DATE as today, T.odate, T.year, T.age FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder' COLUMNS odate DATE PATH '@OrderDate', year CHAR(4) PATH 'year-from-date(@OrderDate)', age CHAR(15) PATH 'xs:date(@OrderDate) - current-date()' ) as T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@OrderDate < current-date() - xdt:dayTimeDuration("P90D")]'); POID ----------5000 5001 5002 5003 5004 5006 TODAY ---------01/21/2009 01/21/2009 01/21/2009 01/21/2009 01/21/2009 01/21/2009 ODATE ---------02/18/2006 02/03/2005 02/29/2004 02/28/2005 11/18/2005 03/01/2006 YEAR ---2006 2005 2004 2005 2005 2006 AGE ----------P1069D -P1449D -P1789D -P1424D -P1161D -P1058D 6 record(s) selected. Figure 8.52 Using date types and functions Note that current-date() produces the current date in UTC time. If you are living in California, where the local time is eight hours behind UTC, then from 4 p.m. onwards currentdate() gives you tomorrow’s date. New functions to produce the local date and time are being added (refer to Table 8.5) but you can also use XQuery functions to adjust a date or a time to a given time zone, such as in the following query: xquery adjust-date-to-timezone(current-date(), xdt:dayTimeDuration("-PT8H")); 8.7.6 Boolean Functions And finally, XQuery Boolean functions are listed in Table 8.6. An example of using the function fn:false is Figure 8.34 in section 8.5 of this chapter. The use of the function fn:not() was discussed in the context of XPath in section 6.9. Please refer to these sections for examples. Table 8.6 Commonly Used Boolean Functions Boolean Functions Description not The function fn:not returns false if the effective Boolean value of a sequence is true, and true if the effective Boolean value of a sequence is false. false The function fn:false returns the value false. true The function fn:true returns the value true.
8.8 8.8 Embedding SQL in XQuery 227 EMBEDDING SQL IN XQUERY In section 6.5, How to Execute XPath in DB2, we explained how the function db2-fn:sqlquery lets you embed SQL in XPath queries. The same works in XQuery FLWOR expressions and it allows you to include relational predicates in your XQuery. You can even pass parameters from the outer XQuery to the embedded SQL statement. Remember that the embedded SQL statement has to return a single column of type XML. For the following examples, note that the table purchaseorder has several relational columns that contain values extracted from the XML document in the same row. CREATE TABLE purchaseorder(poid BIGINT, status VARCHAR(10), custid BIGINT, orderdate DATE, porder XML); An interesting pair of queries is shown in Figure 8.53. The first query is an SQL/XML statement that uses the XMLQUERY function in the SELECT clause to compute the sum of the item prices of any selected order. The WHERE clause restricts the result set to those orders in the table where the relational column status has the value Unshipped, the column orderdate has the value 2006-02-18, and the order information in the XML column contains at least one item with a price greater than 40. For each of these orders, the query computes the sum of all item prices. The second query is a FLWOR expression that produces the same result from our sample data. Its input is defined by the function db2-fn:sqlquery, which produces the sequence of XML documents that are selected by the embedded SQL statement. This allows you to use relational predicates in an XQuery. The XQuery iterates with the for clause over the PurchaseOrder elements of these input documents. For each such element it evaluates the XML predicate on price and returns the sum of item prices for any matching order. SELECT XMLQUERY('$PORDER/PurchaseOrder/sum(item/price)') FROM purchaseorder WHERE status = 'Unshipped' AND orderdate = '2006-02-18' AND XMLEXISTS('$PORDER/PurchaseOrder/item[price > 40]'); xquery for $i in db2-fn:sqlquery("SELECT porder FROM purchaseorder WHERE status = 'Unshipped' AND orderdate = '2006-02-18'" )/PurchaseOrder where $i/item[price >= 40] return sum($i/item/price); Figure 8.53 Two queries that produce the same result There is typically no significant performance difference between the two queries in Figure 8.53. Both can use an XML index on /PurchaseOrder/item/price and relational indexes on status and orderdate at the same time.
228 Chapter 8 Querying XML Data with XQuery Let’s extend the previous example slightly to illustrate parameter passing from XQuery to the enclosed SQL statement. Assume you want to return all orders that have the same shipping status and order date as the purchase order with number 5000. The XQuery in Figure 8.54 does that easily. It uses the for and where clauses to select purchase order 5000 and assign it to the variable $i. The return clause then produces the sequence of all orders where the relational columns status and orderdate have the same value as $i/@Status and $i/@OrderDate respectively. The functions parameter(1) and parameter(2) can only be used in SQL statements inside the db2-fn:sqlquery function. They refer to the XQuery expressions that are provided as additional arguments to the db2-fn:sqlquery function, according to the order in which they appear. That is, $i/@Status is bound to parameter(1) and $i/@OrderDate to parameter(2). Effectively, this is a self-join on the purchaseorder table. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder where $i/@PoNum = 5000 return db2-fn:sqlquery("SELECT porder FROM purchaseorder WHERE status = parameter(1) AND orderdate = parameter(2)", $i/@Status, $i/@OrderDate ); Figure 8.54 XQuery that contains an SQL statement with parameters Figure 8.55 shows how you can code the same self-join in SQL/XML notation without any XQuery concepts beyond XPath. The FROM clause contains two references to the purchaseorder table, p1 and p2. The alias p1 is used in the XMLTABLE function to find purchase order 5000 and to extract Status and OrderDate from it. These generated relational columns are then joined with alias p2 in the WHERE clause to find all orders with the same status and date. The queries in Figure 8.54 and Figure 8.55 look very different from each other, but the DB2 query compiler generates the same execution plan for both. SELECT p2.porder FROM purchaseorder p1, purchaseorder p2, XMLTABLE('$po1/PurchaseOrder[@PoNum = 5000]' passing p1.porder as "po1" COLUMNS status VARCHAR(10) PATH '@Status', orderdate DATE PATH '@OrderDate' ) AS T WHERE p2.status = T.status AND p2.orderdate = T.orderdate; Figure 8.55 A different notation for the same self-join as in Figure 8.54
8.9 8.9 Using SQL Functions and User-Defined Functions in XQuery 229 USING SQL FUNCTIONS AND USER-DEFINED FUNCTIONS IN XQUERY There are many built-in SQL functions that are not part of the XQuery language. For example, functions such as sqrt (square root), rand (random number), or cos (cosine) are available as SQL functions in DB2 but they are not available as built-in XQuery functions. Additionally you might have developed your own user-defined functions (UDFs), either in the SQL Procedural Language (SQP PL) or in an external programming language such as Java or C. It is possible to use such functions from the SQL world within XQuery expressions. The trick is to use the db2fn:sqlquery function to embed SQL functions in XQuery. Assume that you have a legacy application that processes partid values, which are product identifiers, in a different format. For example, a partid such as 100-103-01 needs to be converted to 01(100)103. This is achieved by the UDF in Figure 8.56. It breaks a given partid into its three pieces and assembles them in a different way to meet the requirements of the legacy system. CREATE FUNCTION convert(partid VARCHAR(15)) RETURNS VARCHAR(15) BEGIN ATOMIC DECLARE p1, p2, p3, new VARCHAR(10) DEFAULT ''; SET p1 = substr(partid,1,3); SET p2 = substr(partid,5,3); SET p3 = substr(partid,9,2); SET new = p3||'('||p1||')'||p2; RETURN new; END# Figure 8.56 User-defined function to convert product identifiers The FLWOR expression in Figure 8.57 uses this UDF in its let clause to convert every partid in purchase order 5000 to the different format. The db2-fn:sqlquery function contains an SQL statement, which in this case is simply a VALUES clause. Since the result of the embedded SQL statement must be of type XML, the XMLTEXT function is used to turn the VARCHAR result value of the function convert into an XML text node. The convert function takes a single parameter, which has to be cast to the input type of the function, that is, VARCHAR(15). The expression $i/partid provides the actual value that is passed into the convert function.
230 Chapter 8 Querying XML Data with XQuery xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder/item let $new := db2-fn:sqlquery(" VALUES(XMLTEXT(convert(CAST (parameter(1)as VARCHAR(15)))))", $i/partid) where $i/../@PoNum = 5000 return <out><old>{$i/partid/text()}</old><new>{$new}</new></out>; <out><old>100-100-01</old><new>01(100)100</new></out> <out><old>100-103-01</old><new>01(100)103</new></out> 2 record(s) selected. Figure 8.57 Using an SQL UDF within an XQuery You can use the db2-fn:sqlquery function anywhere where built-in XQuery functions are allowed. Figure 8.58 gives you a couple of ideas. The first FLWOR expression uses the db2-fn:sqlquery function in the construction of the element new. Note that it has to be in curly brackets so that it gets properly evaluated and not treated as a literal string. The second XQuery uses db2-fn:sqlquery in a path expression. The XPath in the return clause is $i/PurchaseOrder/item/partid except that the db2-fn:sqlquery function is applied to the last step, partid. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder/item where $i/../@PoNum = 5000 return <out><old>{$i/partid/text()}</old> <new>{ db2-fn:sqlquery(" VALUES(XMLTEXT(convert(CAST(parameter(1) AS VARCHAR(15)))))", $i/partid) }</new></out>; xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder/@PoNum = 5000 return $i/PurchaseOrder/item/db2-fn:sqlquery(" VALUES(XMLTEXT(convert(CAST(parameter(1) AS VARCHAR(15)))))", partid); Figure 8.58 8.10 Further examples of using the db2-fn:sqlquery function SUMMARY XQuery is a powerful query language for XML data. XPath is a subset of the XQuery language and used in every XQuery expression that accesses XML documents. Hence, XPath is a critical part of XQuery.
8.10 Summary 231 One of the most commonly used expressions in XQuery is the FLWOR expression, which is named after its keywords for, let, where, order by, and return. The for clause of a FLWOR expression lets you iterate over documents, elements, attributes, atomics values, or any sequence of items in the XQuery data model. In each iteration, a variable is assigned to the next item in the sequence for further manipulation. The let clause allows you to assign an entire sequence, such as an intermediate result, to a single variable. The where and order by clauses are used to filter and sort the result of the FLWOR expression. The result is then returned by the return clause, possibly with further manipulation. FLWOR expressions can express queries over sets of documents, perform joins across documents, and combine data from multiple XML documents or different parts of a single document into a query result. Other important expressions in XQuery include constructor expressions, such as direct element and attribute constructors, which are used to create XML nodes and construct new XML documents within a query. Conditional expressions (if-then-else) allow for advanced logic. Additionally, XQuery supports cast expressions, arithmetic expressions, logical and comparison operators, and sequence and transform expressions. XQuery also offers a rich set of built-in functions, such as string functions, numeric functions, aggregation functions, and date and time functions. Not every XML application requires XQuery. Many applications are well-served with the combined power of XPath and SQL. In fact, many queries in XQuery notation can also be expressed in SQL/XML with embedded XPath.
This page intentionally left blank
C H A P T E R 9 Querying XML Data: Advanced Queries & Troubleshooting n this chapter we discuss advanced XML query topics, common errors, and guidelines for avoiding performance pitfalls. The examples include both XQuery and SQL/XML queries. This chapter is organized along the following topics: I • Aggregation and grouping in XML queries (section 9.1) • Joins between XML columns as well as joins between XML and relational data (section 9.2) • XML queries with case-insensitive string predicates (section 9.3) • Guidelines for avoiding common performance problem (section 9.4) • Common errors in XML queries and how to resolve them (section 9.5) 9.1 AGGREGATION AND GROUPING OF XML DATA The recommended and most efficient way to perform grouping and aggregation of XML data is to use the XMLTABLE function to extract XML values to relational columns, and then to apply the SQL GROUP BY clause and SQL aggregation functions to these columns. The XQuery 1.0 language by itself, specifically the FLWOR expression, does not have a GROUP BY clause. This shortcoming makes grouping more difficult in XQuery than SQL, although not entirely impossible. In the following we discuss grouping and aggregation queries that use the purchase order sample data as input. A sample document is shown in Figure 9.1. 233
234 Chapter 9 Querying XML Data: Advanced Queries & Troubleshooting <PurchaseOrder PoNum="5000" OrderDate="2006-02-18" Status="Unshipped"> <item> <partid>100-100-01</partid> <name>Snow Shovel, Basic 22 inch</name> <quantity>3</quantity> <price>9.99</price> </item> <item> <partid>100-103-01</partid> <name>Snow Shovel, Super Deluxe 26 inch</name> <quantity>5</quantity> <price>49.99</price> </item> </PurchaseOrder> Figure 9.1 9.1.1 Sample document in the purchaseorder table Aggregation and Grouping Queries with XMLTABLE As an example, let’s determine the number of purchase orders per year since 2004. This is done in Figure 9.2. The XMLTABLE function together with the year-from-date function produces a relational column year of type CHAR(4). This year column is then used in both the SELECT clause and in the GROUP BY clause, as you normally would with relational columns. The relational COUNT() function produces the desired aggregation. The XMLEXISTS predicate in the WHERE clause ensures that the query only looks at orders that were placed in 2004 or later. SELECT year, COUNT(*) AS num_orders FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder' COLUMNS year CHAR(4) PATH 'year-from-date(@OrderDate)') AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@OrderDate >= xs:date("2004-01-01")]') GROUP BY year; YEAR NUM_ORDERS ---- ----------2004 1 2005 3 2006 2 3 record(s) selected. Figure 9.2 Using SQL group by and aggregation on extracted XML values
9.1 Aggregation and Grouping of XML Data 235 This pattern of writing XML queries has been found very useful. The XMLTABLE function raises selected values from the XML level to the SQL level, and then you can apply SQL functions and groupings to these values as you normally do in purely relational queries. Let’s apply this pattern to another business question. What is the total value of shipped and unshipped items that were ordered in 2006? The answer is computed by the query in Figure 9.3. To write this query, you might want to start with the WHERE clause to restrict the orders to 2006. The path expression in the XMLEXISTS predicate navigates to the OrderDate attribute and checks whether it is greater than or equal to the first day of 2006, and less than or equal to the last day of 2006. Note that both dots in the predicate refer to the OrderDate attribute, which is the current node in the navigation. In the XMLEXISTS predicate, don’t use the year-fromdate function to restrict the orders to 2006 because that function would prevent the use of an XML index that might exist on the OrderDate attribute. NOTE While the WHERE clause takes care of the filtering, the XMLTABLE function extracts the data items needed to aggregate the value of shipped and unshipped items. For each item in an order it produces one row with the item price, quantity, and shipping status. This allows you to use SQL concepts to group by the status and to sum the item values. The value of an item in an order is the item price multiplied by its quantity. SELECT orderstatus, SUM(itemprice * itemqty) AS value FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS orderstatus VARCHAR(10) PATH 'upper-case(../@Status)', itemprice DECIMAL(9,2) PATH 'price', itemqty INTEGER PATH 'quantity') AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder/@OrderDate[ . >= xs:date("2006-01-01") and . <= xs:date("2006-12-31")]') GROUP BY orderstatus; ORDERSTATUS VALUE ----------- --------------------------------SHIPPED 149.87 UNSHIPPED 279.92 2 record(s) selected. Figure 9.3 Total value of shipped and unshipped items in 2006
236 Chapter 9 Querying XML Data: Advanced Queries & Troubleshooting If you need to obtain the total value of shipped and unshipped goods for all years and not just 2006, remove the WHERE clause, extract the year from the OrderDate attribute, and add the year column to the SELECT and GROUP BY clauses (see Figure 9.4). SELECT year, orderstatus, SUM(itemprice * itemqty) AS value FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS year CHAR(4) PATH 'year-from-date(../@OrderDate)', orderstatus VARCHAR(10) PATH 'upper-case(../@Status)', itemprice DECIMAL(9,2) PATH 'price', itemqty INTEGER PATH 'quantity') AS T GROUP BY year, orderstatus; YEAR ---2004 2005 2005 2006 2006 ORDERSTATUS VALUE ----------- --------------------------------SHIPPED 149.87 SHIPPED 263.90 UNSHIPPED 9.99 SHIPPED 149.87 UNSHIPPED 279.92 5 record(s) selected. Figure 9.4 9.1.2 Grouping by multiple XML attributes Aggregation of Values within and across XML Documents The previous queries in Figure 9.3 and Figure 9.4 sum up the item values for shipped and unshipped orders. They do not look at the value of individual orders, which would require the item values within each order to be aggregated first. Let’s look at aggregated item values per order with another business question. What is the minimum, maximum, and average order value in 2005 and 2006? The query in Figure 9.5 answers this question. The value of an order is the sum of all of its item values, and an item value is the item price multiplied by its quantity. The details of the XMLTABLE function are critical for this computation. First of all, the row-generating expression in the XMLTABLE function iterates over PurchaseOrder elements, not over item elements. For each order, the XMLTABLE function produces two columns, containing the year and the value of the order, respectively. The expression sum(item/(price * quantity)) in the definition of the value column is noteworthy. Remember that the row-generating expression in the XMLTABLE function, in this case $PORDER/PurchaseOrder, provides the context for the column-generating expressions. Since one purchase order has usually multiple items, the expression item/(price * quantity) multiplies the price and the quantify of each item, and returns a sequence of as many
9.1 Aggregation and Grouping of XML Data 237 values as there are items in a given order. The surrounding sum function then aggregates these item values to a single order value. This ensures that the entire column-generating expression always returns a single value. The SELECT clause then uses these values to compute the min, max, and average order value across all orders in 2005 and 2006. SELECT year, MIN(value) AS min, CAST(AVG(value) AS decimal(6,2)) AS avg, MAX(value) AS max FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder' COLUMNS year CHAR(4) PATH 'year-from-date(@OrderDate)', value DECIMAL(5,2) PATH 'sum(item/(price * quantity))' ) AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder/@OrderDate[ . >= xs:date("2005-01-01") and . <= xs:date("2006-12-31")]') GROUP BY year; YEAR MIN AVG MAX ---- ------- -------- ------2005 9.99 91.29 139.94 2006 149.87 214.89 279.92 2 record(s) selected. Figure 9.5 Aggregating values within and across XML documents Note that the expression sum(item/(price * quantity)) in the COLUMNS clause of the XMLTABLE function performs aggregation of values within a given document. The MIN, MAX, and AVG functions in the SELECT clause perform aggregation of values across many documents. 9.1.3 Grouping Queries in SQL/XML versus XQuery So far, all the grouping queries in this section have used SQL/XML notation to exploit the SQL GROUP BY clause as well as SQL aggregation functions. The reason is that the XQuery language, specifically the FLWOR expression, does not (yet) have a GROUP BY clause. This makes grouping less intuitive in XQuery. As an example, look at Figure 9.6 and Figure 9.7, which show two different ways to determine how often each different item has been ordered. Figure 9.6 uses the same pattern as before. For each item it extracts the partid and quantity values from the XML data and exposes them as relational columns. Then the familiar SQL GROUP BY clause and COUNT and SUM functions are applied. This shows, for example, that item 100-100-01 appears in five orders with a total order quantity of 14 pieces.
238 Chapter 9 Querying XML Data: Advanced Queries & Troubleshooting SELECT partid, COUNT(*) AS num_orders, SUM(qty) AS total_qty FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS partid VARCHAR(20) PATH 'partid', qty INTEGER PATH 'quantity' ) as T GROUP BY partid; PARTID NUM_ORDERS TOTAL_QTY -------------------- ----------- ----------100-100-01 5 14 100-101-01 3 11 100-103-01 3 9 100-201-01 3 11 4 record(s) selected. Figure 9.6 Preferred query pattern to group and aggregate XML values (SQL/XML) Figure 9.7 shows an XQuery expression that performs the same grouping by means of a self-join. The outermost for clause iterates over all distinct values of the partid element. Each distinct partid value leads to one group in the query result. The first nested for clause (for $i) produces a sequence of all partid values, including duplicates. The outer for clause (for $p) uses this sequence to obtain the four distinct values, 100-100-01, 100-101-01, and so on. In each iteration, the variable $p is assigned to one of these distinct values. For each of these distinct values, the let clause and the return clause are evaluated. The let clause has a nested for clause that produces the sequence of items whose partid matches the current distinct value $p. xquery for $p in distinct-values( for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") return $i/PurchaseOrder/item/partid/text() ) let $items := ( for $j in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") return $j/PurchaseOrder/item[partid = $p] ) return <part num_orders="{count($items)}" total_qty="{sum($items/quantity)}">{$p}</part>; <part <part <part <part num_orders="5" num_orders="3" num_orders="3" num_orders="3" total_qty="14">100-100-01</part> total_qty="11">100-101-01</part> total_qty="9">100-103-01</part> total_qty="11">100-201-01</part> 4 record(s) selected. Figure 9.7 Aggregation and grouping in XQuery is less intuitive and efficient
9.2 Join Queries with XML Data 239 For example, it can be the sequence of all items whose partid is 100-100-01. This sequence is assigned to the variable $items. The return clause then constructs a result that shows the partid ($p) as well as the count and total quantity of the items that have this partid. Although Figure 9.7 is an educating example of a complex XQuery expression, it is not the recommended way of writing grouping queries. The SQL/XML statement in Figure 9.6 performs much better and is more intuitive for most users. 9.2 JOIN QUERIES WITH XML DATA For the discussion of join queries we use the product table and the purchaseorder table in the sample database. Let’s review their column definitions as well as a sample document from their respective XML columns (see Figure 9.8). Each purchase order document contains one or multiple items with partid elements. These partid values reference products in the product table. For each product, the pid is an XML element in the product document and also stored as a relational column. This redundant storage can be very useful. For example, it enables you to define a primary key index on the relational pid column. CREATE TABLE purchaseorder( poid BIGINT, status VARCHAR(10), custid BIGINT, orderdate DATE, porder XML); (2) <PurchaseOrder PoNum="5000" OrderDate="2006-02-18" Status="Unshipped"> <item> <partid>100-100-01</partid> <name>Snow Shovel,Basic…</name> <quantity>3</quantity> <price>9.99</price> </item> <item> <partid>100-103-01</partid> <name>Snow Shovel,Super…</name> <quantity>5</quantity> <price>49.99</price> </item> </PurchaseOrder> Figure 9.8 CREATE TABLE product( pid VARCHAR(10), name VARCHAR(128), price DECIMAL(9,2), promoprice DECIMAL(9,2), promostart DATE, (2) promoend DATE, description XML); <product pid="100-100-01"> <description> <name>Snow Shovel,Basic…</name> (1) <details>Basic Snow Shovel, 22 inches wide, straight handle with D-Grip </details> <price>9.99</price> <weight>1 kg</weight> </description> </product> (1) Two tables for join queries This sample data allows for two interesting types of joins: • First, a join between the XML columns porder and description, indicated by the two arrows labeled as (1) in Figure 9.8. This is an XML-to-XML join. • Second, a join between the XML column porder and the relational column pid, indicated by arrows (2) in Figure 9.8. This is an XML-to-relational join.
240 Chapter 9 9.2.1 Querying XML Data: Advanced Queries & Troubleshooting XQuery Joins between XML Columns Let’s start with a simple join in XQuery notation. After that we show the same join in SQL/XML notation. Assume you want to identify all products that have a weight of 3 kilograms and that are part of any order in the purchaseorder table. The first condition (3 kg) requires a predicate on the XML column description in the product table. The second condition (part of any order) requires a join with the purchaseorder table. This join query is shown in Figure 9.9. The typical pattern of a join in XQuery is a pair of nested for clauses, one for each of the two tables. The variable $po iterates over purchase orders, and the variable $pr iterates over products. The predicate $pr/description/weight = "3 kg" restricts the products to those that weigh 3 kilograms. The predicate $pr/@pid = $po/item/partid is the join predicate. It requires the pid attribute of the product element to be equal to the partid of an item element in a purchase order. The return clause returns the value of the pid attribute of any matching product. The query produces three rows, each having the value 100-103-01. This is because the product 100-103-01 weighs 3 kilograms and appears in three different purchase orders, as you already saw in Figure 9.6 and Figure 9.7. In other words, this product has three join matches in the purchaseorder table, and this leads to three result rows. This multiplication is not specific to XML and you have probably observed it in relational join queries many times. xquery for $po in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder for $pr in db2-fn:xmlcolumn("PRODUCT.DESCRIPTION")/product where $pr/description/weight = "3 kg" and $pr/@pid = $po/item/partid return $pr/data(@pid); 100-103-01 100-103-01 100-103-01 3 record(s) selected. Figure 9.9 Nested for clauses in XQuery to express a join The two nested for clauses produce the Cartesian product between the input sequences, which consist of purchase orders and products, respectively. An analogy in SQL is a SELECT statement with two table names in the FROM clause. The join predicate in the where clause ensures that the entire Cartesian product is not materialized. The order of the two for clauses does not matter and does not determine the join order. The join order for the execution of the query is a cost-based decision of the DB2 optimizer. To enable the use of an XML index to evaluate the join predicate, cast the join keys to the appropriate data type:
9.2 Join Queries with XML Data and 241 $po/item/partid/fn:string(.) = $pr/@pid/fn:string(.) Chapter 13, Defining and Using XML Indexes, describes how to exploit XML indexes for join queries and why the casting is necessary. An interesting and important difference between XML and relational joins is the following. Note that the query in Figure 9.9 returns the join key in the result set. A join key always exists in both tables, and in relational joins you can select the join key from either of the two tables, with no difference in the result set. Figure 9.10 shows what can potentially happen if you try the same in an XML join. The query in Figure 9.10 is the same as the one in Figure 9.9, except that the expression in the return clause is $po/item/partid to return the product identifier from the purchaseorder table instead of the product table. The result contains products that do not weigh 3 kilograms, such as 100-100-01. This result is semantically correct, but probably not what you wanted. The reason for this behavior is at the heart of the fundamental difference between XML and relational data. Relational rows are flat, but XML documents can be nested and can have repeating elements. And in the example at hand, the join key partid is a repeating element. Note that the variable $po iterates over purchase orders, and each purchase order can have multiple item elements. Hence, the expression $po/item/partid represents a sequence of multiple partid elements. The join predicate is satisfied if the identifier (@pid) of a given product matches at least one of the partid values in an order. But, the return clause then returns all the partids. In other words, the query is written to return all partid values in an order, irrespective of their value, if at least one of them matches a 3 kilogram product in the product table. Once more, this is existential semantics at work (see section 6.8, Existential Semantics). xquery for $po in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder for $pr in db2-fn:xmlcolumn("PRODUCT.DESCRIPTION")/product where $pr/description/weight = "3 kg" and $po/item/partid/fn:string(.) = $pr/@pid/fn:string(.) return $po/item/partid/text(); 100-100-01 100-103-01 100-101-01 100-103-01 100-201-01 100-100-01 100-103-01 7 record(s) selected. Figure 9.10 Returning a repeating element from a join can be misleading
242 Chapter 9 Querying XML Data: Advanced Queries & Troubleshooting The query result changes back to the three expected rows if the outer for clause iterates over items instead of purchase orders (see Figure 9.11). The effect of this is that the join predicate now checks the partid of each item element separately. Any item that does not match a 3 kilogram product is eliminated. xquery for $po in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder/item for $pr in db2-fn:xmlcolumn("PRODUCT.DESCRIPTION")/product where $pr/description/weight = "3 kg" and $po/partid/fn:string(.) = $pr/@pid/fn:string(.) return $po/partid/text(); 100-103-01 100-103-01 100-103-01 3 record(s) selected. Figure 9.11 Iterate over repeating elements in the for clause Next, let’s see how the same join query works in SQL/XML. 9.2.2 SQL/XML Joins between XML Columns In this section we use SQL/XML instead of XQuery to identify all products that have a weight of 3 kilograms and that are part of any order in the purchaseorder table. Figure 9.12 and Figure 9.13 show two ways of writing this join in SQL/XML. Both queries return the same three product identifiers as in Figure 9.11. The difference between the two queries in Figure 9.12 and Figure 9.13 is how the join condition is written. Look at their second XMLEXISTS predicate and pay particular attention to the square brackets. In Figure 9.12, the join predicate is of the form: $DESCRIPTION/product[ join-condition ] This predicate is expressed on the DESCRIPTION column of the product table. Thus, XMLEXISTS checks whether there is a product element whose pid attribute is equal to a partid in a purchase order document. SELECT XMLQUERY('$DESCRIPTION/product/data(@pid)') FROM purchaseorder, product WHERE XMLEXISTS('$DESCRIPTION/product/description[weight="3 kg"]') AND XMLEXISTS('$DESCRIPTION/product[ @pid/fn:string(.) = $PORDER/PurchaseOrder/item/partid/fn:string(.)]'); Figure 9.12 Join predicate in XMLEXISTS
9.2 Join Queries with XML Data 243 In Figure 9.13, the join predicate is of the form: $PORDER/PurchaseOrder/item[ join-condition ] This predicate is expressed on the PORDER column of the purchaseorder table. Thus, XMLEXISTS checks whether there is an item element whose partid element is equal to a pid attribute in a product document. SELECT XMLQUERY('$DESCRIPTION/product/data(@pid)') FROM purchaseorder, product WHERE XMLEXISTS('$DESCRIPTION/product/description[weight="3 kg"]') AND XMLEXISTS('$PORDER/PurchaseOrder/item[ partid/fn:string(.) = $DESCRIPTION/product/@pid/fn:string(.) ]'); Figure 9.13 Join predicate in the opposite “direction” than Figure 9.12 Due to the XPath notation of predicates, the join conditions in Figure 9.12 and Figure 9.13 differ in their “direction”; that is, in the order of their operands. In DB2 9 for z/OS and DB2 9.1 and 9.5 for Linux, UNIX, and Windows, this direction determines the join order. As a result, the predicate in Figure 9.13 is typically preferable because its WHERE clause contains one predicate for the product table and one predicate for the purchaseorder table. If proper indexes exist, this query allows DB2 to use index access to both tables and avoid table scans completely. Prior to DB2 9.7, Figure 9.12 cannot avoid a table scan on the purchaseorder table. In DB2 9.7, the DB2 query compiler abstracts from the notation of the XPath predicate and chooses the appropriate join order based on cost estimates. Since DB2 for z/OS does not yet allow column names as implicit variables in XPath, such as $DESCRIPTION and $PORDER in the XMLEXISTS predicates in Figure 9.13, you need to use the PASSING clause in these predicates, as demonstrated in Figure 9.14. SELECT XMLQUERY('$DESCRIPTION/product/data(@pid)') FROM purchaseorder, product WHERE XMLEXISTS('$d/product/description[weight="3 kg"]' PASSING description as "d") AND XMLEXISTS('$p/PurchaseOrder/item[ partid/fn:string(.) = $d/product/@pid/fn:string(.) ]' PASSING description as "d", porder as "p"); Figure 9.14 Join predicate with PASSING clause Let’s extend the query in Figure 9.13 to make things more interesting. In particular, you might want to return more information about the 3 kg products than just the product identifier. Assume you want to extract the product ID and weight from the product document as well as the number,
244 Chapter 9 Querying XML Data: Advanced Queries & Troubleshooting status, and date of the purchase orders where the product appears as an item. This is implemented in Figure 9.15. Note that the WHERE clause is unchanged from Figure 9.13. To extract the desired data values from the product and purchase order documents, you can use one XMLTABLE function for each of the two tables. The result set confirms that product 100-103-01 is the only one that weighs 3 kg, and it appears in purchase orders 5000, 5001, and 5004. Two of the three orders with this product are already shipped. SELECT T1.*, T2.* FROM purchaseorder, product, XMLTABLE('$DESCRIPTION/product' COLUMNS prodid VARCHAR(15) PATH '@pid', weight VARCHAR(5) PATH 'description/weight') AS T1, XMLTABLE('$PORDER/PurchaseOrder' COLUMNS ponum INTEGER PATH '@PoNum', status VARCHAR(15) PATH '@Status', odate DATE PATH '@OrderDate') AS T2 WHERE XMLEXISTS('$DESCRIPTION/product/description[weight="3 kg"]') AND XMLEXISTS('$PORDER/PurchaseOrder/item[ partid/fn:string(.) = $DESCRIPTION/product/@pid/fn:string(.) ]'); PRODID --------------100-103-01 100-103-01 100-103-01 WEIGHT PONUM ------ ----------3 kg 5000 3 kg 5001 3 kg 5004 STATUS --------------Unshipped Shipped Shipped ODATE ---------02/18/2006 02/03/2005 11/18/2005 3 record(s) selected. Figure 9.15 SQL/XML join query with XMLTABLE functions Note that the query in Figure 9.15 has a join predicate between the rows of the product and purchaseorder tables. It does not have a join predicate between the rows produced by the two XMLTABLE functions. This is not needed here, because both XMLTABLE functions produce exactly one row per document; that is, one row for each row of the underlying tables. Hence, the join predicate between the product and purchaseorder tables is sufficient. Now let’s take a look at a slightly trickier case. Suppose you want to modify the query in Figure 9.15 so that it returns the item quantity and price from a matching purchase order instead of the order status and date. To achieve this you need to modify the second XMLTABLE function to extract the desired item information (see Figure 9.16). Since quantity and price occur per item, with multiple items per purchase order, it seems reasonable to extend the row-generating expression of the XMLTABLE function to iterate over $PORDER/PurchaseOrder/item. However, the result set seems wrong. For example, it suggests that product 100-103-01 appears two
9.2 Join Queries with XML Data 245 times in purchase order 5000, with quantities 3 and 5, and with two different prices. But from the previous queries and sample data you know that this is not true. Only the second of the first two rows in the result set is correct. SELECT T1.*, T2.* FROM purchaseorder, product, XMLTABLE('$DESCRIPTION/product' COLUMNS prodid VARCHAR(15) PATH '@pid', weight VARCHAR(5) PATH 'description/weight') AS T1, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS ponum INTEGER PATH '../@PoNum', qty INTEGER PATH 'quantity', price DECIMAL(6,2) PATH 'price') AS T2 WHERE XMLEXISTS('$DESCRIPTION/product/description[weight="3 kg"]') AND XMLEXISTS('$PORDER/PurchaseOrder/item[ partid/fn:string(.) = $DESCRIPTION/product/@pid/fn:string(.) ]'); PRODID --------------100-103-01 100-103-01 100-103-01 100-103-01 100-103-01 100-103-01 100-103-01 WEIGHT PONUM QTY PRICE ------ ----------- ----------- -------3 kg 5000 3 9.99 3 kg 5000 5 49.99 3 kg 5001 1 19.99 3 kg 5001 2 49.99 3 kg 5001 1 3.99 3 kg 5004 4 9.99 3 kg 5004 2 49.99 7 record(s) selected. Figure 9.16 Misleading result due to multiple item elements per order The reason for the misleading result set in Figure 9.16 is that the second XMLTABLE function produces multiple rows per purchase order. Although the query has a join predicate between the product and purchaseorder tables, it has no join predicate between the rows produced by the two XMLTABLE functions. Hence, the two XMLTABLE functions generate a Cartesian product that produces misleading tuples in the result set. In particular, the product identifier and weight of a product is combined with the quantity and price of all items of a purchase order, and not only with the one item in the purchase order that actually matches. This produces additional and “wrong” rows in the result set. The solution is to use an additional predicate to remove these extraneous rows. This is a one-line change in Figure 9.16. Augment the row-generating expression for purchase order items with a predicate on partid and pass in the prodid produced by the other XMLTABLE function. Figure 9.17 shows the changed query and the desired result set.
246 Chapter 9 Querying XML Data: Advanced Queries & Troubleshooting SELECT T1.*, T2.* FROM purchaseorder, product, XMLTABLE('$DESCRIPTION/product' COLUMNS prodid VARCHAR(15) PATH '@pid', weight VARCHAR(5) PATH 'description/weight') AS T1, XMLTABLE('$PORDER/PurchaseOrder/item[partid=$p]' passing T1.prodid as "p" COLUMNS ponum INTEGER PATH '../@PoNum', qty INTEGER PATH 'quantity', price DECIMAL(6,2) PATH 'price') AS T2 WHERE XMLEXISTS('$DESCRIPTION/product/description[weight="3 kg"]') AND XMLEXISTS('$PORDER/PurchaseOrder/item[ partid/fn:string(.) = $DESCRIPTION/product/@pid/fn:string(.) ]'); PRODID --------------100-103-01 100-103-01 100-103-01 WEIGHT PONUM QTY PRICE ------ ----------- ----------- -------3 kg 5000 5 49.99 3 kg 5001 2 49.99 3 kg 5004 2 49.99 3 record(s) selected. Figure 9.17 Using an extra predicate to filter the item elements per order Alternatively, you can extend the second XMLTABLE function so that it produces partid values as a column, and add the join condition T1.PRODID = T2.PARTID to the WHERE clause. This is shown in Figure 9.18 and also produces the correct result. SELECT T1.*, T2.ponum, T2.qty, T2.price FROM purchaseorder, product, XMLTABLE('$DESCRIPTION/product' COLUMNS prodid VARCHAR(15) PATH '@pid', weight VARCHAR(5) PATH 'description/weight') AS T1, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS ponum INTEGER PATH '../@PoNum', partid VARCHAR(15) PATH 'partid', qty INTEGER PATH 'quantity', price DECIMAL(6,2) PATH 'price') AS T2 WHERE XMLEXISTS('$DESCRIPTION/product/description[weight="3 kg"]') AND XMLEXISTS('$PORDER/PurchaseOrder/item[ partid/fn:string(.) = $DESCRIPTION/product/@pid/fn:string(.) ]') AND T1.PRODID = T2.PARTID; Figure 9.18 Correct result with multiple item elements per order
9.2 Join Queries with XML Data 247 Another way to write this join and produce the correct result is to use a single XMLTABLE function with an XQuery FLWOR expression. You see this in Figure 9.19, which no longer has any XMLEXISTS predicates because all predicates are included in the FLWOR expression. Since the product and purchaseorder tables appear in the FROM clause of the query, the FLWOR expression references their XML columns through the variables $DESCRIPTION and $PRODUCT. Since you want to combine data from two different tables into joined result rows, the return clause of the FLWOR expression has to construct XML fragments that combine the desired elements and attributes. In Figure 9.19, this is the constructed element <result>, which contains attributes and elements from each matching pair of product and purchase order documents. Remember that in such a construction all attributes must appear before any child element. This is why @pid and @PoNum are the first two items in <result>. The constructed <result> elements are input to the COLUMNS clause where they are broken up into relational columns. SELECT T.* FROM purchaseorder, product, XMLTABLE('for $pr in $DESCRIPTION/product for $po in $PORDER/PurchaseOrder/item where $pr/description/weight = "3 kg" and $pr/@pid/fn:string(.) = $po/partid/fn:string(.) return <result> {$pr/@pid} {$po/../@PoNum} {$pr/description/weight} {$po/quantity} {$po/price} </result>' COLUMNS prodid VARCHAR(15) PATH '@pid', weight VARCHAR(5) PATH 'weight', ponum INTEGER PATH '@PoNum', qty INTEGER PATH 'quantity', price DECIMAL(6,2) PATH 'price') AS T; Figure 9.19 Join query with FLWOR expression inside XMLTABLE The advantage of the query in Figure 9.19 over the query Figure 9.18 is that the absence of XMLEXISTS predicates makes it somewhat simpler. Also, the embedded FLWOR expression can iterate over the repeating item elements and apply the join condition at that level. In Figure 9.18 the additional predicate T1.PRODID = T2.PARTID is required to achieve the same. A slight disadvantage of the query in Figure 9.19 is that you have to temporarily construct XML fragments in the return clause, only to break them up again in the COLUMNS clause of the XMLTABLE function. This is fine for small result sets, but introduces overhead for large result sets. An advantage of the query in Figure 9.18 is that it runs on all platforms, whereas the FLWOR expression in Figure 9.19 is not yet available in DB2 9 for z/OS.
248 Chapter 9 9.2.3 Querying XML Data: Advanced Queries & Troubleshooting Joins between XML and Relational Columns The product table holds the product identifier not only in each product’s XML document, but also in the relational column pid. This allows us to illustrate XML-to-relational joins. Assume that you want to find all orders placed in 2006 or later that contain items (products) whose promotional price is greater than 15. The corresponding SQL/XML query is shown in two versions in Figure 9.20. The first version uses the passing clause to pass the XML column porder and the relational column pid into the XMLEXISTS predicate. The second version omits the passing clause and references these columns as implicit variables $PORDER and $PID. Either way, the key mechanism of these XML-to-relational joins is that a relational value is referenced in the XMLEXISTS predicate and compared to an XML element value. Both versions of the query produce the same result and have the same execution plan. They can use relational indexes on product.promoprice and purchaseorder.orderdate as well as an XML index on /purchaseOrder/item/partid. They cannot use an index on the relational column product.pid because this column is referenced in an XML predicate for which relational indexes are not eligible. -- with “passing” clause, for all platforms: SELECT po.poid, po.orderdate, pr.pid, pr.price, pr.promoprice FROM purchaseorder po, product pr WHERE pr.promoprice > 15 AND po.orderdate >= '01/01/2006' AND XMLEXISTS('$p/PurchaseOrder/item[partid = $prodid]' passing po.porder as "p", pr.pid as "prodid"); -- without “passing” clause, for Linux, UNIX, and Windows: SELECT po.poid, po.orderdate, pr.pid, pr.price, pr.promoprice FROM purchaseorder po, product pr WHERE pr.promoprice > 15 AND po.orderdate >= '01/01/2006' AND XMLEXISTS('$PORDER/PurchaseOrder/item[partid = $PID]'); POID --------5000 5006 ORDERDATE ---------02/18/2006 03/01/2006 PID PRICE PROMOPRICE ---------- --------------- --------------100-103-01 49.99 39.99 100-101-01 19.99 15.99 2 record(s) selected. Figure 9.20 Join predicate between an XML and a relational column The query in Figure 9.20 performs the XML-to-relational join by bringing the relational column into the XML context; that is, into the XMLEXISTS predicate. In some cases it is possible to take the opposite approach; that is, to bring the XML side of the join to the relational level and express the join with a relational predicate. This is shown in Figure 9.21. In this query, the functions
9.2 Join Queries with XML Data 249 XMLCAST and XMLQUERY extract the value of the XML element partid and convert it to the relational data type VARCHAR(10). This allows the partid value to participate in a relational equality predicate with the column pr.pid. In our example, this query fails at runtime because a purchase order has multiple items. As a result, the XMLQUERY function produces a sequence of two or more elements. This causes the XMLCAST function to fail because it can only cast one value at a time. However, this type of join predicate works fine if the XML element or attribute that participates in the join condition occurs at most once per document. In that case DB2 can use a relational index on product.pid to evaluate the join. It cannot use an XML index on /PurchaseOrder/item/partid because the join comparison is a relational predicate, not an XML predicate. SELECT po.poid, po.orderdate, pr.pid, pr.price, pr.promoprice FROM purchaseorder po, product pr WHERE pr.promoprice > 15 AND po.orderdate >= '01/01/2006' AND pr.pid = XMLCAST( XMLQUERY('$PORDER/PurchaseOrder/item/partid') AS VARCHAR(10)); SQL16003N An expression of data type "( item(), item()+ )" cannot be used when the data type "VARCHAR_10" is expected in the context. Figure 9.21 Cannot cast a sequence of multiple items to an SQL data type To avoid error SQL16003N, use the XMLTABLE function instead of XMLQUERY with XMLCAST (see Figure 9.22). This produces a separate relational value for each partid element in an order, and each of these values is checked in the join predicate pr.pid = T.partid. SELECT po.poid, po.orderdate, pr.pid, pr.price, pr.promoprice FROM purchaseorder po, product pr, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS partid VARCHAR(10) PATH 'partid') AS T WHERE pr.promoprice > 15 AND po.orderdate >= '01/01/2006' AND pr.pid = T.partid; POID --------5000 5006 ORDERDATE ---------02/18/2006 03/01/2006 PID PRICE PROMOPRICE ---------- --------------- --------------100-103-01 49.99 39.99 100-101-01 19.99 15.99 2 record(s) selected. Figure 9.22 Using XMLTABLE to facilitate a relational join predicate
250 Chapter 9 Querying XML Data: Advanced Queries & Troubleshooting A join between an XML column and a relational column is also possible if you use XQuery rather than SQL/XML. Figure 9.23 shows the typical pattern of an XQuery join; that is, a nested pair of for clauses. The first for clause iterates over the purchase order item elements. The second for clause iterates over product elements of the description documents selected by the embedded SQL query. The SQL query contains the join predicate, which is pid = parameter(1). This join predicate is expressed in relational terms, and the XML element $po/partid is passed into the relational context as the parameter value. The return clause constructs result elements to resemble the output of query Figure 9.22. One challenge is to get the relational column promoprice into the result. One possible solution is to use the db2-fn: sqlquery function a second time in the return clause. The embedded SQL statement joins back to the matching row of the product table and produces an XML element promo whose value is the relational promoprice column. xquery for $po in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder/item for $pr in db2-fn:sqlquery("SELECT description FROM product WHERE promoprice > 15 AND pid = parameter(1)", $po/partid)/product where $po/../@OrderDate >= xs:date("2006-01-01") return <result> {$po/../@PoNum} {$po/../@OrderDate} <pid>{$po/partid/text()}</pid> {$pr/description/price} {db2-fn:sqlquery("SELECT XMLELEMENT(name ""promo"", promoprice) FROM product WHERE pid = parameter(1)", $po/partid)} </result>; <result PoNum="5000" OrderDate="2006-02-18"><pid>100-103-01</pid> <price>49.99</price> <promo>15.99</promo></result> <result PoNum="5006" OrderDate="2006-03-01"><pid>100-101-01</pid> <price>19.99</price> <promo>39.99</promo></result> 2 record(s) selected. Figure 9.23 9.2.4 XQuery join between an XML column and a relational column Outer Joins between XML Columns Roughly speaking, an outer join between two tables includes all rows from one of the tables in the join result, even if no match is found in the other table. You can formulate a left outer join or a
9.2 Join Queries with XML Data 251 right outer join to indicate which of the two tables has its rows retained in the result set. The SQL language has specific keywords to write such outer join queries, but XQuery does not. Still, outer joins can be expressed very naturally in XQuery. For example, assume you want to retrieve information from the product table for all products whose price is less than 100, and include order dates if any of those products appear as items in purchase orders. This requires a left outer join between the product and purchaseorder tables to return products even if they haven’t been ordered. Figure 9.24 shows this outer join in XQuery notation. The trick is to include the join to the purchaseorder table in the element construction of the return clause. The return clause constructs the product information regardless of whether matching purchase orders exist or not. If a product has matching orders, the order dates are included in the constructed document. Otherwise just the product information is returned, without order dates. This achieves the outer join behavior. If you swap the two for clauses in Figure 9.24 then this reverses the row-preserving side of the outer join from product to purchaseorder. xquery for $pr in db2-fn:xmlcolumn("PRODUCT.DESCRIPTION")/product where $pr/description/price < 100 return <productinformation> {$pr} {for $po in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder where $po/item/partid/fn:string(.) = $pr/@pid/fn:string(.) return <orderdate>{$po/data(@PoNum)}</orderdate> } </productinformation>; Figure 9.24 Left outer join between product and purchaseorder The query in Figure 9.25 achieves the same result as the one in Figure 9.24. The only difference is that the inner for-where-return expression is now evaluated in a let clause and then referenced as $orderdates in the return clause. xquery for $pr in db2-fn:xmlcolumn("PRODUCT.DESCRIPTION")/product let $orderdates := for $po in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder where $po/item/partid/fn:string(.) = $pr/@pid/fn:string(.) return <orderdate>{$po/data(@PoNum)}</orderdate> where $pr/description/price < 100 return <productinformation> {$pr} {$orderdates} </productinformation>; Figure 9.25 Left outer join between product and purchaseorder, with let clause
252 Chapter 9 Querying XML Data: Advanced Queries & Troubleshooting If you prefer to return the join result in relational format, you can plug the query from Figure 9.24 or Figure 9.25 into an XMLTABLE function of an SQL SELECT statement, similar to what we illustrated in Figure 9.19. 9.3 CASE-INSENSITIVE XML QUERIES The values of XML elements and attributes are by definition case sensitive. For example, if you search <city> elements for the value “New York”, you will not find “NEW YORK” or “new york” or “New york”. One way to solve this is to use the XQuery function fn:upper-case() to convert both sides of the predicate to uppercase, as in Figure 9.26. In this query, the search string is provided through a parameter marker, which is passed into the XML predicate as the variable $c. Both $c and the value of the city element are converted to uppercase before comparison. This makes the search case-insensitive, but performance may be suboptimal because the use of such functions precludes the use of XML indexes. SELECT XMLQUERY('$INFO/customerinfo/addr/city/text()') FROM customer WHERE XMLEXISTS('$XMLDOC/customerinfo/addr[fn:upper-case(city) = fn:upper-case($c) ]' PASSING CAST(? AS VARCHAR(15)) AS "c"); Figure 9.26 Case-insensitive predicate, which cannot use an XML index Case-insensitive queries with index usage are possible as follows. DB2 for Linux, UNIX, and Windows supports locale-aware Unicode collations since DB2 9.5 Fixpack 1. This allows you to ignore case and/or accents. To create a database that is case-insensitive for all string comparisons, use the collation UCA500R1 as in Figure 9.27. CREATE DATABASE testdb USING CODESET UTF-8 TERRITORY US COLLATE USING UCA500R1_LEN_S2; Figure 9.27 Create a case-insensitive database UCA500R1 specifies that the default Unicode Collation Algorithm (UCA) based on the Unicode standard version 5.0.0 is used in this database. The ordering of characters can be customized using optional attributes. The attributes are separated by an underscore. The collation name UCA500R1_LEN_S2 contains the attributes LEN and S2. LEN is the concatenation of L (language) and EN (ISO 639-1 language code for English). The second attribute S2 specifies the strength level. Strength level 2 specifies that upper- versus lowercase is ignored but that accents are not ignored. For example, cliche is equal to Cliche but not to cliché. Note that the collation does not change or convert your data, but only defines how string comparisons are performed.
9.4 How to Avoid “Bad” Queries 253 If you define your database with a case-insensitive collation, all string comparisons and indexes are automatically case-insensitive and the use of the upper-case function is not needed. Also, the case of the search string no longer matters. Searching for “Beijing” or “BEIJING” returns the same result. This applies to all relational and XML data in the entire database. It is not possible to restrict the case insensitivity to specific tables or columns. The collation can only be defined when the database is created. It cannot be altered later and cannot be specified per query or per application. Hence, the collation is a far reaching and irreversible design decision for your database. Note that the case insensitivity only applies to element and attribute values, not to the tag names themselves. XML tags and path expressions are still case sensitive. For example, the two XPath expressions /customerinfo/city (lowercase “c”) and /Customerinfo/City (uppercase “C”) are still different. The latter would not find any elements in our sample data, because the <city> element in our sample data is spelled in lowercase. 9.4 HOW TO AVOID “BAD” QUERIES One characteristic that SQL queries have in common with SQL/XML and XQuery is that logically the same query can be written in many different ways. But, just because a query can be written in a certain way does not mean that it should be written that way. You should write queries so that they are intuitive, easy to understand, and easy for DB2 to optimize and process. Let’s look at a few examples in this section. 9.4.1 Construction of Excessively Large Documents The query in Figure 9.28 constructs a top-level element ShippedOrders and the content of this element is the sequence of all order documents whose status is Shipped. Note that this query returns a single large document that contains all qualifying orders. Whether this query is a good idea depends on the number of shipped orders. Combining a small number of orders in a single document is fine. However, the larger the number of shipped orders the more trouble you can have with this query. First, returning many individual documents is more efficient than combining them into a single large document. Second, consuming applications often have trouble processing a document that’s tens or hundreds of megabytes in size, especially if the application uses a DOM parser. And finally, if the constructed document exceeds 2GB in size then it cannot be transmitted from the DB2 server to the client application. As a remedy, use a query such as in Figure 9.29 that returns each shipped order as a separate document. An application can easily concatenate them if needed. Alternatively you can use DB2’s EXPORT utility to write many documents to a single large file on disk.
254 Chapter 9 Querying XML Data: Advanced Queries & Troubleshooting xquery <ShippedOrders> {for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder[@Status="Shipped"] return $i } </ShippedOrders>; Figure 9.28 Construction of a single large result document xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder[@Status="Shipped"] return <ShippedOrder>{$i}><ShippedOrder>; Figure 9.29 9.4.2 Returning many small documents is often more efficient “Between” Predicates on XML Data “Between” predicates are very common. For example, you might ask for all orders between March and June, for all products with a weight between 1 and 5 kilograms, or for all customers with a last name between “A” and “M.” The SQL language has the explicit between keyword to formulate such predicates, but this does not exist in XPath or XQuery. Instead you can use a pair of range predicates. For example, the query in Figure 9.30 tries to retrieve all orders in 2006 or later that have items with a price between 20 and 30. One purchase order is returned but it does not seem to match the intention of the query. It contains two items, but neither one has a price between 20 and 30. And yet, this document is a correct result for the query as it is written. Once again, this is due to existential semantics (see section 6.8) and the fact that item is a repeating element. The predicates item/price >= 20 and item/price < 30 check whether an item element exists whose price element has a value greater than or equal to 20, and if there also exists an item element whose price is less than 30. But these two item elements do not have to be the same. In fact, the selected purchase order fulfills the predicate item/price >= 20 because there is an item whose price is 49.99. It also fulfills the predicate item/price < 30 because there is an item whose price is 9.99.
9.4 How to Avoid “Bad” Queries 255 SELECT porder FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[@OrderDate > xs:date("2006-01-01") and item/price >= 20 and item/price < 30 ]'); <PurchaseOrder PoNum="5000" OrderDate="2006-02-18" Status="Unshi pped"><item><partid>100-100-01</partid><name>Snow Shovel, Basic 22 inch</name><quantity>3</quantity><price>9.99</price></item><i tem><partid>100-103-01</partid><name>Snow Shovel, Super Deluxe 2 6 inch</name><quantity>5</quantity><price>49.99</price></item></ PurchaseOrder> 1 record(s) selected. Figure 9.30 Wrong way to write a between predicate Both SQL/XML statements in Figure 9.31 write the “between” condition correctly and ensure that both range predicates are applied to the same item price. In the expression item/price[. >= 20 and . < 30], both dots refer to the same price element. Hence, this query selects orders that have at least one item with at least one price element whose value is indeed between 20 and 30. (No such order exists in the sample database.) Based on this notation, DB2 knows that both range predicates are always applied to the same XML node. This allows DB2 to evaluate both predicates with a single start-stop scan (start at 20, stop at 30) over an XML index defined on the price element. SELECT porder FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[@OrderDate > xs:date("2006-01-01") and item/price[. >= 20 and . < 30]]'); SELECT porder FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[@OrderDate > xs:date("2006-01-01")]/item/price[. >= 20 and . < 30]'); 0 record(s) selected. Figure 9.31 Correct way to write a between predicate If each item element has at most one price element, then the expression item[price >= 20 and price < 30] also selects the correct query result. However, DB2 does not know that each item has at most one price and therefore cannot apply a single start-stop index scan. Instead, DB2 has to use two separate index scans plus an index ANDing operator to combine the result (see Table 9.1). This is less efficient. Therefore it is always recommended to write “between”
256 Chapter 9 Querying XML Data: Advanced Queries & Troubleshooting predicates with the “dot” (current context), as shown in Figure 9.31. Further details on XML index usage and execution plans are provided in Chapters 13 and 14. Table 9.1 Optimal (left) and Suboptimal Execution Plan (right) price[. >= 20 and . < 30] [price >= 20 and price < 30] RETURN | NLJOIN | /-+-\ / \ FETCH XSCAN | /---+---\ / \ RIDSCN TABLE: | purchaseorder SORT | XISCAN 20 <= price < 10 RETURN | NLJOIN | /-+-\ / \ FETCH XSCAN | /---+---\ / \ RIDSCN TABLE: | purchaseorder SORT | IXAND | /-+-\ / \ XISCAN XISCAN price >= 20 price < 30 Index 20 30 9.4.3 Index 20 30 Large Global Sequences Figure 9.32 provides another example of how you should not write queries. The idea of this query comes from a real XML application, but is changed here to fit the purchase order data. The query starts with a let clause and assigns the sequence of all purchase order items in the table to the variable $allitems. This is the first of multiple problems in this query. Unless the table is tiny, the sequence in $allitems is typically very large. Using let to combine items from all (or many) documents in the entire table often results in suboptimal performance. The next step of the query, for $pid…, iterates over the distinct partid values of all the item elements in the sequence $allitems. For each distinct partid it returns a constructed XML element prod_info that contains the partid (produced by $pid) as well as the name and the price of the item. Note how the name and the price are obtained for each distinct partid; that is, for each value of $pid. The variable $pid is used to probe back into the sequence $allitems to find all items with a matching partid. This probe happens in the predicate $allitems[partid = $pid]. The same is done for price. This coding is not straightforward, needlessly complex, and bad for performance. In particular, the big sequence $allitems is a large temporary object and not indexed. Hence, the predicates
9.4 How to Avoid “Bad” Queries 257 in the return clause ([partid = $pid]) both require a sequential scan over all items in all purchase orders, for each $pid. An analogy in the relational world would be a query that copies all rows from a table to a temporary table, then performs a “select distinct” on that table to obtain a set of keys, and then a table scan on the temp table for each of these keys. xquery let $allitems := ( for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") return $i/PurchaseOrder/item ) for $pid in distinct-values($allitems/partid) order by $pid return <prod_info product = "{$pid}"> <name>{distinct-values($allitems[partid = $pid]/name)}</name> <price>{distinct-values($allitems[partid = $pid]/price)}</price> </prod_info>; Figure 9.32 Expensive usage of large sequences The result of the query in Figure 9.32 is simply the partid, name, and price for all distinct items that occur in the purchase orders. The same result can be computed in a much easier way, as shown in Figure 9.33. This query simply generates one tuple for each item element and uses the SQL function DISTINCT to remove duplicates. In the original case, the performance improved by two orders of magnitude. The rewritten query is also easier to understand. SELECT distinct T.pid, T.name, T.price FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS pid VARCHAR(10) PATH 'partid', name VARCHAR(50) PATH 'name', price INTEGER PATH 'price') as T; Figure 9.33 9.4.4 Rewritten query avoids large intermediate sequences Multilevel Nesting SQL and XQuery A general guideline is to introduce only as much complexity in your queries as you really need. For example, it is certainly possible to have an XQuery with an embedded SQL statement that has an embedded XQuery, and so on. But, experience shows that nesting the two languages more than one level deep is usually not needed to express the desired query logic. Therefore, we recommend using only one level of embedding XQuery into SQL or vice versa. As a result, queries are easier to understand and to maintain, and often also easier to optimize and execute for DB2. Figure 9.34 shows an example of an XQuery with an embedded SQL statement, which in turn has embedded XQuery expressions in the XMLQUERY function and XMLEXSISTS predicate. The
258 Chapter 9 Querying XML Data: Advanced Queries & Troubleshooting embedded SQL statement produces the purchase order elements from all orders that belong to customer 1001 and whose PoNum attribute has the value 1002. For those orders, the XQuery checks whether the Status is Shipped and returns all order items in a newly constructed element POitems. Using XQuery within the SQL statement and around the SQL statement is needlessly complex. xquery for $i in db2-fn:sqlquery(" SELECT XMLQUERY('$PORDER/PurchaseOrder') FROM purchaseorder WHERE custid =1001 AND XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5002]') ") where $i[@Status="Shipped"] return <POitems>{$i/item}</POitems>; Figure 9.34 Unnecessary double-nesting of XQuery and SQL To simplify the query in Figure 9.34, you can choose to either have all XML manipulation outside of the SQL query or all XML manipulation embedded within the SQL query. Both options are demonstrated in Figure 9.35. In the first query in Figure 9.35, all XML operations are pulled out of the SQL statement and into the surrounding XQuery. In the second query, all XML operations are pushed from the surrounding XQuery into the SQL statement. xquery for $i in db2-fn:sqlquery("SELECT porder FROM purchaseorder WHERE custid =1001") where $i/PurchaseOrder[@PoNum = 5002 and @Status="Shipped"] return <POitems>{$i/ PurchaseOrder/item}</POitems>; SELECT XMLQUERY('<POitems>{$PORDER/PurchaseOrder/item}</POitems>') FROM purchaseorder WHERE custid =1001 AND XMLEXISTS('$PORDER/PurchaseOrder[@PoNum = 5002 and @Status="Shipped"]'); Figure 9.35 9.5 Two simpler versions of the query in Figure 9.34 COMMON ERRORS AND HOW TO AVOID THEM This section lists some common error messages that you might encounter when you run XML queries. We discuss probable causes and ways to resolve the problems. DB2 has more than 250 XML-related error messages and we cannot discuss all of them here. Additionally, a specific error message might have multiple different causes and we cannot describe all of them in this section. Therefore we look at a few select queries, their errors, and how to fix them.
9.5 Common Errors and How to Avoid Them 259 Error messages related to XML processing have numbers in the 16000-range of messages and SQL Codes. That is, the SQL Codes related to XML processing errors are -16000, -16001, -16002, and so on. This is the same in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. Additionally, in DB2 for Linux, UNIX, and Windows the error messages for these SQL Codes are numbered SQL16000N, SQL16001N, SQL16002N, and so on. Each error message raised by a faulty XML query also contains an error code, such as err:XPDY0002, which is the error code defined by the W3C. These error codes are listed at http://www.w3.org/2005/xqt-errors/, and you can also search for them in the DB2 information center. 9.5.1 SQL16001N Figure 9.36 and Figure 9.37 show queries that fail at compile time with error SQL16001N, which indicates that an XPath or XQuery expression does not have a context; that is, the path does not have a proper starting point. In Figure 9.36, INFO is not a valid context, because the XML column name is only recognized if coded as a variable that starts with a $ sign ($INFO). SELECT info FROM customer WHERE XMLEXISTS('INFO/customerinfo[name="Matt Foreman"]'); SQL16001N An XQuery expression starting with token "INFO" cannot be processed because the focus component of the dynamic context has not been assigned. Error QName=err:XPDY0002. SQLSTATE=10501 Figure 9.36 Use $INFO instead of INFO to avoid this error In Figure 9.37, the path in the return clause starts with /addr, but no context is provided to indicate from where this expression should navigate to the addr element. The correct coding in this query is $c/addr instead of /addr. xquery for $c in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return /addr[@country]; SQL16001N An XQuery expression starting with token "/" cannot be processed because the focus component of the dynamic context has not been assigned. Error QName=err:XPDY0002. SQLSTATE=10501 Figure 9.37 The path in the return clause should start with $c 9.5.2 SQL16002N The error SQL16002N happens at compile time whenever the query parser encounters a keyword or symbol that is unexpected or not recognized. This can happen in many different cases. The query in Figure 9.38 fails because the uppercase keyword FOR is not valid. It has to be lowercase.
260 Chapter 9 Querying XML Data: Advanced Queries & Troubleshooting xquery FOR $d IN db2-fn:xmlcolumn ("customer.info")/customerinfo RETURN $d; SQL16002N An XQuery expression has an unexpected token "d" following "FOR $". Expected tokens may include: "". Error QName=err:XPST0003. SQLSTATE=10505 Figure 9.38 The keywords for, in, and return must be lowercase In Figure 9.39, the expression $INFO/customerinfo/ must not end with a slash (/). The slash starts another step in the XPath expression and must be followed be an element name, attribute name, wildcard (*), function name, and so on. Hence the empty string "" after the / is not expected. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo/' COLUMNS name VARCHAR(20) PATH 'name', city VARCHAR(20) PATH 'addr/city' ) as T; SQL16002N An XQuery expression has an unexpected token "" following "$INFO/customerinfo". Expected tokens may include: "<StepExpr>". Figure 9.39 To avoid this error remove the / after customerinfo Furthermore, a slash cannot be followed by the square bracket that begins a predicate. Therefore the square bracket in Figure 9.40 causes error SQL16002N. SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE XMLEXISTS('$INFO/customerinfo/[addr/city = "Aurora"]'); SQL16002N An XQuery expression has an unexpected token "[" following "tomerinfo/". Expected tokens may include: "". Figure 9.40 A predicate must not be preceded by a slash (/) 9.5.3 SQL16003N Error SQL16003N happens during query execution; that is, at runtime and not at compile time. It indicates that DB2 has encountered a value of a certain data type that is not valid in this situation. The query in Figure 9.41 fails because a sequence of multiple phone elements cannot be cast to a single SQL value. In this error message, the notation ( item(), item()+ ) is a regular expression that represents a sequence of one item followed by one or more items. In total that’s two or more items, but only a single item is allowed here.
9.5 Common Errors and How to Avoid Them 261 SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custname VARCHAR(20) PATH 'name', phone VARCHAR(15) PATH 'phone') AS T; SQL16003N An expression of data type "( item(), item()+ )" cannot be used when the data type "VARCHAR_15" is expected in the context. Error QName=err:XPTY0004. SQLSTATE=10507 Figure 9.41 Cannot cast multiple phone numbers to a single VARCHAR value Figure 9.42 shows a query that fails because it tries to compare a value of type xs:date with the value "2006-02-18Z” of type xs:string, which is not allowed. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder/xs:date(@OrderDate) = "2006-02-18Z" return $i; SQL16003N An expression of data type "xs:string" cannot be used when the data type "xs:date" is expected in the context. Error QName=err:XPTY0004. SQLSTATE=10507 Figure 9.42 The string literal “2006-02-18Z” must be cast to xs:date 9.5.4 SQL16005N The query in Figure 9.43 references a variable $c that has not been properly introduced. Normally, variables are introduced by assignment in a for or a let clause. Here, the for clause defines the variable $b, which should be used instead of $c in the return clause. xquery for $b in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return $c/name; SQL16005N An XQuery expression references an element name, attribute name, type name, function name, namespace prefix, or variable name "c" that is not defined within the static context. Error QName=err:XPST0008. SQLSTATE=10506 Figure 9.43 The variable $c has not been introduced Figure 9.44 demonstrates a trickier case. The query tries to return a sequence of name and addr elements, but it lacks parentheses. The expression return ($b/name, $b/addr) is correct and avoids the error. The error message claims that the variable $b is not known. Clearly, $b has been defined in the for clause, so the error is seemingly misleading or even wrong.
262 Chapter 9 Querying XML Data: Advanced Queries & Troubleshooting xquery for $b in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return $b/name, $b/addr; SQL16005N An XQuery expression references an element name, attribute name,type name, function name, namespace prefix, or variable name "b" that is notdefined within the static context. Error QName=err:XPST0008. SQLSTATE=10506 Figure 9.44 Missing parentheses in the return clause But, the error message in Figure 9.44 is correct. The comma in the return clause is the XQuery comma operator, which constructs sequences. It has the lowest precedence of all operators. Hence, the XQuery expression in Figure 9.44 defines a sequence of two expressions, which are • for $b in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return $b/name • $b/addr In the first expression, $b is properly introduced in the for clause. In the second expression, $b is not defined, which causes the error message. If you change the return clause to return ($b/name, $b/addr), the parentheses ensure that the comma operator only applies to $b/name and $b/addr, and both of these expressions refer to $b defined in the for clause. The use of the parentheses here is similar to parentheses in arithmetics, such as 3 * (2 + 3) to evaluate the + operator before the multiplication operator. 9.5.5 SQL16015N When you construct elements with a direct element constructor, and you include a sequence of expressions that provide the child nodes, attributes (if any) must come before elements in this sequence. xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return <info>{$i/name}{$i/@Cid}</info>; SQL16015N An element constructor contains an attribute node named "Cid" that follows an XQuery node that is not an attribute node. QName=err:XQTY0024. SQLSTATE=10507 Figure 9.45 Within a constructed element, attributes must be first
9.5 Common Errors and How to Avoid Them 263 The error in Figure 9.45 is avoided if you construct the info element as return <info>{$i/@Cid}{$i/name}</info>; or as return <info Cid="{$i/@Cid}”>{$i/name}</info>; 9.5.6 SQL16011N The query in Figure 9.46 iterates over the distinct OrderDate values of the purchase order documents. The where clause tries to convert each of these values to xs:date for a proper date comparison with a literal value. But the expression raises error SQL16011N because $i contains an atomic value and not an element or attribute node. An atomic value cannot be the input to a navigation step, such as the navigation step /xs:date(.). You can only navigate on nodes, not on atomic values. xquery for $i in distinct-values( db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder/@OrderDate ) where $i/xs:date(.) < xs:date("2006-12-31") return $i/.. ; SQL16011N The result of an intermediate step expression in an XQuery path expression contains an atomic value. Error QName=err:XPTY0019. SQLSTATE=10507 Figure 9.46 Cannot navigate on an atomic value If you remove the distinct-values function, then $i gets bound to OrderDate attribute nodes, and the error is avoided. But removing the depulication can also change the query result. Hence, the better way to avoid the error is to replace $i/xs:date(.) with xs:date($i) so that there is no navigation step on $i. 9.5.7 SQL16061N The XMLEXISTS predicate in Figure 9.47 checks whether the value of the Status attribute is 1. The literal value 1 is interpreted as a number because it is not enclosed in quotes. To perform a valid numeric comparison with this number, the value of the Status attribute is automatically cast to xs:double. But, if the value is a string such as "Unshipped", this cast fails with error SQL16061N.
264 Chapter 9 Querying XML Data: Advanced Queries & Troubleshooting SELECT porder FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[@Status = 1]'); SQL16061N The value "Unshipped" cannot be constructed as, or cast (using an implicit or explicit cast) to the data type "xs:double" Figure 9.47 Failure to cast an attribute value to a numeric type 9.5.8 SQL16075N The query in Figure 9.48 tries to return the Status attribute node. When the query result is serialized to XML text, the serialization of the attribute node fails. It cannot exist by itself outside of an element. The solution is to return the attribute value instead of the attribute node by using the data() or string() function, such as $PORDER/PurchaseOrder/data(@Status). Another solution is to wrap the XMLCAST function around the XMLQUERY function. SELECT XMLQUERY('$PORDER/PurchaseOrder/@Status') FROM purchaseorder; SQL16075N The sequence to be serialized contains an item that is an attribute node. Figure 9.48 9.6 Cannot serialize an attribute node by itself SUMMARY The preferred way to write grouping and aggregation queries for XML data is to use the XMLTABLE function. It allows you to extract XML values to relational columns and then to apply the SQL GROUP BY clause and SQL aggregation functions to these columns. This pattern of writing XML queries provides a high degree of flexibility and allows you to reuse familiar SQL features in XML queries. The XMLTABLE function brings selected values from the XML level to the SQL level so that you can apply any SQL expressions or functions to these values as you normally do in relational queries. Efficient join queries between XML columns can be written in XQuery or SQL/XML. Either way, remember that the join predicate requires casting to a specific data type. Otherwise an XML index cannot be used to evaluate the predicate. When you use SQL/XML with a join condition in an XMLEXISTS predicate, remember that in DB2 9.1 and 9.5 for Linux, UNIX, and Windows the order of the operands in the join predicate determines the join order between the two tables. Consider the available indexes and check your queries’ execution plans to ensure an appropriate join order, index usage, and adequate performance.
9.6 Summary 265 Observing a set of guidelines can help you avoid common pitfalls with XML queries. When you write queries that construct new XML documents, be mindful of an appropriate document granularity and avoid creating excessively large documents. When you write a pair of range predicates to express a “between” condition, remember to use the current context (a dot) within the square brackets of the predicate, such as item/price[. >= 20 and . < 30]. This notation ensures that you get the semantically correct result and it allows for an efficient execution plan with a single index scan. A query pattern that can easily lead to poor performance is an XQuery let clause that builds a single large sequence of elements from all (or many) documents in an XML column. A query construct such as let $i := db2-fn:xmlcolumn("T.C")/xpath should generally be avoided. An XQuery that contains such an expression can often be replaced by a much more efficient SQL/XML query with the XMLTABLE function.
This page intentionally left blank
C H A P T E R 10 Producing XML from Relational Data ince XML has emerged as the de facto standard for data exchange, an increasing number of organizations, applications, and interfaces expect to receive data in XML format. For example, web services and enterprise service buses (ESBs) frequently use XML messages to facilitate the information exchange between applications or services. XML is the fabric of Service-Oriented Architectures (SOA). A frequent requirement is that new applications need to consume existing relational data in XML format. Converting entire relational databases to XML format is rarely feasible nor recommended. Instead, the preferred approach is to run queries against the relational data and convert the result set to XML. This conversion from relational to XML can be performed in the application layer, but it is often labor-intensive to develop and maintain procedural application code that constructs XML. Letting the database engine convert relational data to XML is easier and more efficient. Easier, because the construction of XML can be defined in declarative SQL statements. More efficient, because DB2 can construct XML as part of the query processing, which avoids repetitive calls to the database to obtain all required values for an XML document. S This chapter explains how to write queries that read relational tables and return the result set in XML format. There are two ways of writing such queries: • The first approach uses the SQL/XML publishing functions, most of which have been supported since Version 8 of both DB2 for z/OS and DB2 for Linux, UNIX, and Windows. This approach is explained in section 10.1. • The second approach uses direct XML constructors of the XQuery language. This is supported in DB2 9.1 and higher for Linux, UNIX, and Windows, and described in section 10.2. 267
268 Chapter 10 Producing XML from Relational Data For completeness, sections 10.3 and 10.4 cover the special topics of XML declarations and XML document nodes for constructed XML documents. Many examples in this chapter use the product table of the sample database. Figure 10.1 shows the content of the relational columns of this table; the XML column description is omitted and the column name is truncated for space. We demonstrate a variety of queries that construct XML from this relational data. SELECT pid, price, promoprice, promostart, promoend, SUBSTR(name,1,15) AS name FROM product; PID ---------100-100-01 100-101-01 100-103-01 100-201-01 PRICE PROMOPRICE PROMOSTART PROMOEND NAME ----- --------- ---------- ---------- --------------9.99 7.25 11/19/2004 12/19/2004 Snow Shovel,Bas 19.99 15.99 12/18/2005 02/28/2006 Snow Shovel,Del 49.99 39.99 12/22/2005 02/22/2006 Snow Shovel,Sup 3.99 - Ice Scraper,Win 4 record(s) selected. Figure 10.1 10.1 The relational data in the product table SQL/XML PUBLISHING FUNCTIONS In this section we examine the SQL/XML publishing functions in DB2. These functions are also known as “constructor” functions because they construct XML nodes, such as elements and attributes, whose values are taken from relational columns. They are listed in Table 10.1 in the order in which they are introduced in the following sections. Table 10.1 SQL/XML Publishing Functions Function Purpose XMLELEMENT Constructs an XML element XMLCONCAT Concatenates two values of type XML XMLFOREST Constructs a sequence of XML elements XMLATTRIBUTES Constructs one or more XML attributes XMLAGG Aggregates XML values from multiple rows into a single XML value XMLROW Constructs XML elements with default tagging XMLGROUP Constructs and aggregates XML elements with default tagging
10.1 SQL/XML Publishing Functions Table 10.1 269 SQL/XML Publishing Functions (Continued) Function Purpose XMLCOMMENT Constructs an XML comment XMLPI Constructs an XML processing instruction XMLTEXT Constructs an XML text node XMLDOCUMENT Constructs an XML document node All of these functions are available in DB2 for z/OS and DB2 for Linux, UNIX, and Windows, except XMLGROUP and XMLROW, which do not exist in DB2 for z/OS. XMLGROUP and XMLROW merely serve as abbreviations for certain combinations of the other functions. Examples in section 10.1.9 show that the XML data constructed by XMLGROUP and XMLROW can also be constructed in DB2 for z/OS with the other publishing functions. All SQL/XML functions belong to the relational schema SYSIBM. 10.1.1 Constructing XML Elements from Relational Data The most commonly used XML publishing function is XMLELEMENT, which constructs an XML element. In its simplest form, the XMLELEMENT function takes two arguments: • The name of the XML element that you want to construct. • An expression that provides the value of the constructed element. This expression is often just the name of a relational column. Later you will also see that the XMLELEMENT function can take multiple and complex expressions as input. The SELECT statement in Figure 10.2 is a simple example of using the XMLELEMENT function. For each row in the product table it constructs an XML element called pnum that contains the value of the relational column pid. Each of the four constructed elements is a separate wellformed XML document. The return type of the XMLELEMENT function, and therefore of the column produced by the query in Figure 10.2, is XML. SELECT XMLELEMENT(NAME "pnum", pid) FROM product; <pnum>100-100-01</pnum> <pnum>100-101-01</pnum> <pnum>100-103-01</pnum> <pnum>100-201-01</pnum> 4 record(s) selected. Figure 10.2 Constructing XML elements
270 Chapter 10 Producing XML from Relational Data The query shown in Figure 10.3 is an extension of the query in Figure 10.2. It returns two columns of type XML, each containing a constructed XML element for every row of the product table. The values of the relational columns pid and price are returned as separate XML elements in separate columns. The optional AS clauses give each column a descriptive column name. SELECT XMLELEMENT(NAME "pnum", pid) AS pnum_elem, XMLELEMENT(NAME "cost", price) AS cost_elem FROM product; PNUM_ELEM ----------------------<pnum>100-100-01</pnum> <pnum>100-101-01</pnum> <pnum>100-103-01</pnum> <pnum>100-201-01</pnum> COST_ELEM -----------------<cost>9.99</cost> <cost>19.99</cost> <cost>49.99</cost> <cost>3.99</cost> 4 record(s) selected. Figure 10.3 Constructing XML elements in separate columns You can use the publishing function XMLCONCAT to combine the two constructed elements into a single column of type XML, as shown in Figure 10.4. Each result row contains a sequence of two XML elements that do not have a common root element and therefore do not form a well-formed document. SELECT XMLCONCAT(XMLELEMENT(NAME "pnum", pid), XMLELEMENT(NAME "cost", price) ) AS twoelem FROM product; TWOELEM ----------------------------------------<pnum>100-100-01</pnum><cost>9.99</cost> <pnum>100-101-01</pnum><cost>19.99</cost> <pnum>100-103-01</pnum><cost>49.99</cost> <pnum>100-201-01</pnum><cost>3.99</cost> 4 record(s) selected. Figure 10.4 Concatenating two XML elements To produce a well-formed document in each result row, use nested XMLELEMENT functions to construct a common root element. This is easy because one or multiple XMLELEMENT functions can be arguments to another XMLELEMENT function. Figure 10.5 shows how to construct the root element Product, which contains the two elements pnum and cost as child elements. Note that the nesting of the XMLELEMENT functions corresponds directly to the nesting of the generated XML elements in the result set of the query. The outer XMLELEMENT function constructs the root element Product, which ensures that the generated XML documents are well-formed.
10.1 SQL/XML Publishing Functions 271 SELECT XMLELEMENT(NAME "Product", XMLELEMENT(NAME "pnum", pid), XMLELEMENT(NAME "cost", price) ) AS prod_doc FROM product; PROD_DOC -----------------------------------------------------------<Product><pnum>100-100-01</pnum><cost>9.99</cost></Product> <Product><pnum>100-101-01</pnum><cost>19.99</cost></Product> <Product><pnum>100-103-01</pnum><cost>49.99</cost></Product> <Product><pnum>100-201-01</pnum><cost>3.99</cost></Product> 4 record(s) selected. Figure 10.5 Constructing XML documents with nested elements If you want to add the promotional price as well as the start and end date of the promotion period to each generated XML document, simply add additional XMLELEMENT functions as arguments to the top-level XMLELEMENT function. This is illustrated in Figure 10.6. Due to the WHERE clause, this query returns just one result row that contains an XML document with values from one of the original relational rows in the product table. The DB2 Command Line Processor (CLP) displays this document as a single wrapping line. For readability, we also show the document with added newline characters and indentation. SELECT XMLELEMENT(NAME "Product", XMLELEMENT(NAME "pnum", pid), XMLELEMENT(NAME "cost", price), XMLELEMENT(NAME "promoprice", promoprice), XMLELEMENT(NAME "start", promostart), XMLELEMENT(NAME "end", promoend) ) FROM product WHERE pid = '100-100-01'; -- Output as a single wrapped line: <Product><pnum>100-100-01</pnum><cost>9.99</cost><promoprice>7.2 5</promoprice><start>2004-11-19</start><end>2004-12-19</end></Pr oduct> -- Output with newline characters and indentation: <Product> <pnum>100-100-01</pnum> <cost>9.99</cost> <promoprice>7.25</promoprice> <start>2004-11-19</start> Figure 10.6 Constructing XML documents with more nested elements (continues)
272 Chapter 10 Producing XML from Relational Data <end>2004-12-19</end> </Product> 1 record(s) selected. Figure 10.6 Constructing XML documents with more nested elements (Continued) Looking at the query in Figure 10.6, it is easy to realize that constructing larger XML documents can require many nested XMLELEMENT functions. To keep queries short and easy to write, the function XMLFOREST serves as an abbreviation for a sequence of XMLELEMENT functions. The XMLFOREST function takes a list of arguments as input and constructs an XML element for each argument. Each argument is a pair consisting of a relational column name or other expression and the desired element name. The generated XML elements are all siblings of each other. As an example, the query in Figure 10.7 produces the same result as the one in Figure 10.6. However, you will soon see that XMLFOREST and XMLELEMENT have a different default behavior when NULL values are involved (section 10.1.2). SELECT XMLELEMENT(NAME "Product", XMLFOREST(pid AS "pnum", price AS "cost", promoprice AS "promoprice", promostart AS "start", promoend AS "end" ) ) FROM product WHERE pid = '100-100-01'; Figure 10.7 Constructing XML documents with the XMLFOREST function The XMLFOREST function can abbreviate the query even further if you are willing to use the relational column names of the source table as default element names. In this case you can omit the custom element names from the XMLFOREST function and only provide a list of relational column names (see Figure 10.8). The default element names produced by XMLFOREST are in uppercase, because uppercase is the default for SQL column names—unless you used lowercase column names in double quotes in the CREATE TABLE statement. Producing default elements names based on column names is not possible with the XMLELEMENT function. SELECT XMLELEMENT(NAME "Product", XMLFOREST(pid, price, promoprice, promostart, promoend) ) FROM product WHERE pid = '100-100-01'; <Product> <PID>100-100-01</PID> <PRICE>9.99</PRICE> <PROMOPRICE>7.25</PROMOPRICE> Figure 10.8 Using the XMLFOREST function with default element names
10.1 SQL/XML Publishing Functions 273 <PROMOSTART>2004-11-19</PROMOSTART> <PROMOEND>2004-12-19</PROMOEND> </Product> 1 record(s) selected. Figure 10.8 Using the XMLFOREST function with default element names (Continued) The document constructed in Figure 10.8 is very flat with only one level of nesting. However, you might be required to generate XML documents that conform to a mandatory target format that involves multiple levels of nesting. Suppose that all the promotion-related information has to be nested under a separate PROMOTION element. Such a document structure is produced in Figure 10.9. Again, note that the nesting of the SQL/XML functions implies the nesting of elements in the generated document. The top-level XMLELEMENT function, which constructs the PRODUCT element, contains one XMLFOREST function and one XMLELEMENT function. The XMLFOREST function constructs the elements PID and PRICE while the XMLELEMENT function generates the element PROMOTION. This XMLELEMENT function includes another XMLFOREST function that produces the child elements PROMOPRICE, PROMOSTART, and PROMOEND. SELECT XMLELEMENT(NAME "PRODUCT", XMLFOREST(pid, price), XMLELEMENT(NAME "PROMOTION", XMLFOREST(promoprice, promostart, promoend) ) ) FROM product WHERE pid = '100-100-01'; <PRODUCT> <PID>100-100-01</PID> <PRICE>9.99</PRICE> <PROMOTION> <PROMOPRICE>7.25</PROMOPRICE> <PROMOSTART>2004-11-19</PROMOSTART> <PROMOEND>2004-12-19</PROMOEND> </PROMOTION> </PRODUCT> 1 record(s) selected. Figure 10.9 Constructing a document with multiple levels of nesting Note that the XMLFOREST function that contained five column names in Figure 10.8 is broken into two separate XMLFOREST functions in Figure 10.9. The reason is that in Figure 10.9 the elements PROMOPRICE, PROMOSTART, and PROMOEND should be generated at a different level of the document than the elements PID and PRICE. A single XMLFOREST function always produces a sequence of sibling elements for the same level in the document.
274 Chapter 10 Producing XML from Relational Data 10.1.2 NULL Values, Missing Elements, and Empty Elements In Figure 10.1, which shows the relational data in the product table, you saw that one of the rows contains NULL values in the columns promoprice, promostart, and promoend. If a NULL value is input to an XMLELEMENT function, the default behavior is to generate an empty element. Figure 10.10 shows that the empty element <promoprice/> is constructed where the corresponding cell in the product table is NULL. This behavior is called “Empty on NULL.” SELECT XMLELEMENT(NAME "Prod", XMLELEMENT(NAME "PID", pid), XMLELEMENT(NAME "promoprice", promoprice) ) FROM product WHERE price < 10; <Prod><PID>100-100-01</PID><promoprice>7.25</promoprice></Prod> <Prod><PID>100-201-01</PID><promoprice/></Prod> 2 record(s) selected. Figure 10.10 The “Empty on NULL” behavior of the XMLELEMENT function Alternatively, you may prefer to omit the <promoprice/> element from the generated document, so that NULL values result in missing elements rather than empty elements. This behavior is called “NULL on NULL” and can be forced by inserting the keywords OPTION NULL ON NULL into the XMLELEMENT function (see Figure 10.11). SELECT XMLELEMENT(NAME "Prod", XMLELEMENT(NAME "PID", pid), XMLELEMENT(NAME "promoprice", promoprice OPTION NULL ON NULL ) ) FROM product WHERE price < 10; <Prod><PID>100-100-01</PID><promoprice>7.25</promoprice></Prod> <Prod><PID>100-201-01</PID></Prod> 2 record(s) selected. Figure 10.11 The “NULL on NULL” option of the XMLELEMENT function Beware that the default NULL handling of the XMLFOREST function is opposite to the behavior of the XMLELEMENT function! This “mismatch” is defined by the SQL/XML standard and is not an arbitrary choice made by DB2. By default, the XMLFOREST function does not construct elements for NULL values (“NULL on NULL”) but you can specify OPTION EMPTY ON NULL to turn NULLs into empty elements. Figure 10.12 illustrates this behavior.
10.1 SQL/XML Publishing Functions 275 SELECT XMLELEMENT(NAME "Prod", XMLFOREST(pid, promoprice) ) FROM product WHERE price < 10; <Prod><PID>100-100-01</PID><PROMOPRICE>7.25</PROMOPRICE></Prod> <Prod><PID>100-201-01</PID></Prod> 2 record(s) selected. SELECT XMLELEMENT(NAME "Prod", XMLFOREST(pid, promoprice OPTION EMPTY ON NULL) ) FROM product WHERE price < 10; <Prod><PID>100-100-01</PID><PROMOPRICE>7.25</PROMOPRICE></Prod> <Prod><PID>100-201-01</PID><promoprice/></Prod> 2 record(s) selected. Figure 10.12 “NULL on NULL” default and “Empty on NULL” option for XMLFOREST If your relational data contains many NULL values, the “NULL on NULL” behavior is usually preferable. This option avoids large numbers of empty elements, which reduces the size of the constructed documents and improves the performance of the SQL/XML publishing queries. NOTE 10.1.3 Constructing XML Attributes from Relational Data XML attributes always belong to an XML element and can never appear by themselves. The construction of XML attributes is therefore always combined with the construction of an XML element. The function XMLATTRIBUTES creates one or multiple XML attributes and can only appear as an argument to an XMLELEMENT function. The arguments for the XMLATTRIBUTES function look just like the arguments for XMLFOREST; that is, one or more relational column names or other expressions. While the XMLFOREST function uses such input to build a sequence of XML elements, the XMLATTRIBUTES function uses this input to construct a sequence of XML attributes. Optionally, each argument can be associated with a desired attribute name. If the attribute names are omitted, the relational column names of the source table are used as default attribute names.
276 Chapter 10 Producing XML from Relational Data The query in Figure 10.13 constructs an element called Product that contains the attributes pnum and cost, which hold the values of the relational columns pid and price, respectively. The Product element itself is an empty element (denoted by the slash at its end) because the XMLELEMENT function contains no expression to provide element content. SELECT XMLELEMENT(NAME "Product", XMLATTRIBUTES(pid AS "pnum", price AS "cost") ) FROM product; <Product <Product <Product <Product pnum="100-100-01" pnum="100-101-01" pnum="100-103-01" pnum="100-201-01" cost="9.99"/> cost="19.99"/> cost="49.99"/> cost="3.99"/> 4 record(s) selected Figure 10.13 Returning relational values as attributes For each qualifying row of the product table, the query in Figure 10.14 constructs an XML document that consists of a mix of XML elements and attributes. The nesting of the XML constructor functions implies the nesting of the tags in the generated XML data. The root element PRODUCT contains two attributes, PID and PRICE, whose names default to the names of the referenced relational columns. The PRODUCT element also contains a child element PROMOTION, which in turn contains an attribute PROMOPRICE as well as two child elements generated by the XMLFOREST function. When you specify the children of an XML element, attributes have to come before any child elements. For example XMLATTRIBUTES(promoprice) is specified before XMLFOREST(promostart, promoend) and it cannot be the other way round. SELECT XMLELEMENT(NAME "PRODUCT", XMLATTRIBUTES(pid, price), XMLELEMENT(NAME "PROMOTION", XMLATTRIBUTES(promoprice), XMLFOREST(promostart, promoend) ) ) FROM product WHERE pid = '100-100-01'; <PRODUCT PID="100-100-01" PRICE="9.99"> <PROMOTION PROMOPRICE="7.25"> <PROMOSTART>2004-11-19</PROMOSTART> <PROMOEND>2004-12-19</PROMOEND> </PROMOTION> </PRODUCT> 1 record(s) selected. Figure 10.14 Constructing a document with mix of elements and attributes
10.1 SQL/XML Publishing Functions 277 If an argument to XMLATTRIBUTES is NULL, no attribute is constructed for that argument, and there is no option to construct empty attributes for NULL values. NOTE Since an XML element cannot have two attributes with the same name, DB2 rejects any XMLATTRIBUTES function that tries to construct two attributes with identical names (SQL error SQL0242N). 10.1.4 Constructing XML Documents from Multiple Relational Rows So far all the examples in this chapter have constructed exactly one XML document for each qualifying relational row. In other words, each selected relational row is turned into one corresponding document. However, it is often desirable to combine data from multiple relational rows into a single XML document. For example, consider the relational data in the purchaseorder table and note that there can be multiple orders per customer (see Figure 10.15). SELECT poid, status, custid, orderdate FROM purchaseorder; POID STATUS CUSTID ORDERDATE -------------- --------- -------------- ---------5002 Shipped 1001 02/29/2004 5000 Unshipped 1002 02/18/2006 5003 Shipped 1002 02/28/2005 5006 Shipped 1002 03/01/2006 5001 Shipped 1003 02/03/2005 5004 Shipped 1005 11/18/2005 6 record(s) selected. Figure 10.15 The relational data in the purchaseorder table Suppose you have to construct one XML document with order information for each customer, such as the documents shown in Figure 10.16. Customer 1001 has one order, but customer 1002 has three orders stored in three different rows of the table. The POID values of these three orders are combined into a single document for customer 1002. This document represents the one-tomany relationship between customers and purchase orders through its hierarchical structure and the repeated occurrence of the child element order.
278 Chapter 10 Producing XML from Relational Data <CustOrders cid="1001"> <order>5002</order> </CustOrders> <CustOrders cid="1002"> <order>5000</order> <order>5003</order> <order>5006</order> </CustOrders> (...) Figure 10.16 One document per customer with purchase order information Producing the documents in Figure 10.16 requires grouping and aggregation of orders by customer. The function XMLAGG achieves this aggregation in Figure 10.17. This query contains the SQL clause GROUP BY custid because the objective is to produce one XML document per customer. For each customer, the query constructs an element CustOrders with an attribute cid that identifies the customer. For each CustOrders element, the XMLAGG function produces a single XML value. This value is a sequence of order elements that represent the orders of the respective customer. Let’s zoom in on this to understand exactly how it works. SELECT XMLELEMENT(name "CustOrders", XMLATTRIBUTES(custid as "cid"), XMLAGG( XMLELEMENT(name "order", poid ) ) ) FROM purchaseorder GROUP BY custid; Figure 10.17 Using XMLAGG to aggregate order information per customer XMLAGG is an aggregate function and it behaves much like any other SQL aggregate function, such as SUM, AVG, MIN, MAX, or COUNT. Any such aggregate function, including XMLAGG, takes values from multiple rows as input and produces a single output value. For example, the aggregate function AVG takes numeric values from multiple rows as input and produces a single numeric output value. Similarly, the function MAX also takes values from multiple rows as input and produces a single output value (see Figure 10.18). Correspondingly, XMLAGG takes XML values from multiple rows as input and produces a single XML output value. The single value that is produced by AVG is the arithmetic mean of its input arguments. The single value produced by MAX is the largest value of its input. And the single value produced by XMLAGG is an XML sequence that combines the XML input arguments. Remember that a value in the XQuery Data Model is always a sequence of zero or more items. In Figure 10.18, XMLAGG takes three sequences as input; each of them contains one XML element. The output of XMLAGG is a single sequence that contains three elements. Just like MAX is an aggregate function for date, string, or numeric values, XMLAGG is an aggregate function for XML type values.
10.1 SQL/XML Publishing Functions 279 3 rows 5000 1 row, 1 value MAX 5006 5003 5006 3 rows 5000 Figure 10.18 XMLELEMENT <order>5000</order> 5003 <order>5003</order> 5006 <order>5006</order> 1 row, 1 value XMLAGG <order>5000</order> <order>5003</order> <order>5006</order> The aggregation functions MAX and XMLAGG There are further ways in which XMLAGG behaves like other SQL aggregate functions. The SQL query in Figure 10.19(a) produces one value per group; that is, the maximum poid for each distinct custid value. Because custid appears in the SELECT clause and is not an argument to an aggregate function, it must also appear in the GROUP BY clause. Without the GROUP BY clause, as in Figure 10.19(b), the entire input table is one group and the aggregate function MAX is applied to the entire table. Hence, the maximum poid value across all rows is returned. In this case, custid must not appear in the SELECT list. All of these characteristics apply similarly to the XMLAGG function. The query in Figure 10.17 corresponds to the query in Figure 10.19(a) because the custid column appears in the SELECT clause but is not an argument of XMLAGG. Hence, custid must also be specified in the GROUP BY clause and the query produces one document per custid. The query in Figure 10.20 corresponds to Figure 10.19(b), because without a GROUP BY clause XMLAGG aggregates order elements based on all rows in the table and returns a single large document. SELECT custid, MAX(poid) AS max FROM purchaseorder GROUP BY custid; CUSTID MAX ------- -----1001 5002 1002 5006 1003 5001 1005 5004 SELECT MAX(poid) AS max FROM purchaseorder; MAX ------5006 1 record(s) selected. 4 record(s) selected. (a) Figure 10.19 (b) Aggregation with and without grouping
280 Chapter 10 Producing XML from Relational Data SELECT XMLELEMENT(name "CustOrders", XMLAGG(XMLELEMENT(name "order", poid )) ) FROM purchaseorder; <CustOrders> <order>5000</order> <order>5001</order> <order>5002</order> <order>5003</order> <order>5004</order> <order>5006</order> </CustOrders> 1 record(s) selected. Figure 10.20 Aggregation without grouping produces a single row The query in Figure 10.21 is an extension of the one in Figure 10.17. The order elements that are aggregated per customer are now ordered by the orderdate column in the product table. This affects the order in which the order elements appear as child elements within each CustOrders element. The query in Figure 10.21 also includes date attributes in the aggregated order elements, although this is not a requirement for sorting them by orderdate. The order elements can also contain nested child elements; that is, they can be the root of larger XML fragments that describe each order. SELECT XMLELEMENT(name "CustOrders", XMLATTRIBUTES(custid as "cid"), XMLAGG( XMLELEMENT(name "order", XMLATTRIBUTES(orderdate AS "date"), poid ) ORDER BY orderdate) ) FROM purchaseorder WHERE custid IN (1001,1002) GROUP BY custid; <CustOrders cid="1001"> <order date="2004-02-29">5002</order> </CustOrders> <CustOrders cid="1002"> <order date="2005-02-28">5003</order> <order date="2006-02-18">5000</order> <order date="2006-03-01">5006</order> </CustOrders> 2 record(s) selected. Figure 10.21 XMLAGG with ORDER BY clause
10.1 SQL/XML Publishing Functions 10.1.5 281 Constructing XML Documents from Multiple Relational Tables The previous section has described the construction of one XML document per customer. Each document contains information for each of the customers’ orders. The XMLAGG function was used to effectively convert the one-to-many relationship between customers and purchase orders into nested repeating elements in the constructed documents. To further extend this scenario, recall that there is also a one-to-many relationship between purchase orders and products (see section 9.2, Join Queries with XML Data). Each order contains multiple products (items). This permits the construction of more detailed documents that not only have repeating order elements for each customer but also repeating item elements for each order. Figure 10.22 shows an example of a document that you might want to construct. It contains information about all orders that customer 1002 has placed. For each order, the purchase order identifier from the relational column poid is provided as attribute oid, and the column orderdate as child element date. Additionally, the identifiers of all products in a purchase order are listed as item elements. Each item element also contains the promotional product price as an attribute, if applicable. <CustOrders cid="1002"> <order oid="5000"> <date>2006-02-18</date> <item promprice="7.25">100-100-01</item> <item promprice="39.99">100-103-01</item> </order> <order oid="5003"> <date>2005-02-28</date> <item promprice="7.25">100-100-01</item> </order> <order oid="5006"> <date>2006-03-01</date> <item promprice="7.25">100-100-01</item> <item promprice="15.99">100-101-01</item> <item>100-201-01</item> </order> </CustOrders> Figure 10.22 Document for customer 1002, with all related orders and all items per order Note the two levels of nested and repeating elements in Figure 10.22. The root element CustOrders contains a variable number of order elements, and each order element contains a variable number of item elements. Two XMLAGG functions are required to construct this document structure, as shown in Figure 10.23. The first and “outer” XMLAGG function in Figure 10.23 aggregates order elements per customer. The second and “inner” XMLAGG function aggregates item elements per order. Note that the XMLELEMENT function that constructs the order element contains four arguments: • The constant “order” to specify the element name. • An XMLATTRIBUTES function to construct the oid attribute.
282 Chapter 10 Producing XML from Relational Data • An XMLEMENT function to construct the child element date. • An expression that produces a single value of type XML to add further child elements. This expression is a scalar subselect, which is a SELECT statement that produces exactly one value (one row in one column). The subselect is a query against the product table and retrieves the columns pid and promoprice. The WHERE clause of the subquery contains a join predicate to only read those rows from the product table that match the items in a given purchase order. The join condition is expressed between the XML column PORDER of the purchaseorder table and the PID column of the product table. If there were relational join keys then this could be a regular relational join predicate. The subselect constructs an item element for each product in a given purchase order and then uses XMLAGG to aggregate these item elements into a single sequence. This sequence is a single value of type XML, which ensures that the subselect is indeed a scalar subselect. Since the subselect is an input argument to an XMLELEMENT function, it must produce a single value. SELECT XMLELEMENT(name "CustOrders", XMLATTRIBUTES(po.custid as "cid"), XMLAGG( XMLELEMENT(name "order", XMLATTRIBUTES(po.poid as "oid"), XMLELEMENT(name "date", po.orderdate), (SELECT XMLAGG( XMLELEMENT(name "item", XMLATTRIBUTES(promoprice as "promprice"), pr.pid) ) FROM product pr WHERE XMLEXISTS('$PORDER/PurchaseOrder/item[partid = $PID]')) ) ) ) FROM purchaseorder po WHERE po.custid = 1002 GROUP BY po.custid; Figure 10.23 Constructing XML based on two relational tables The query in Figure 10.23 constructs the document format in Figure 10.22 by nesting XML construction functions and subqueries according to the nested document structure that is to be produced. If the XML construction at each level of nesting becomes more complex, another way of writing the same query can sometimes be helpful. The query in Figure 10.24 also produces the document in Figure 10.22 but uses common table expressions for a somewhat more modular approach to constructing XML data. The WITH clause defines one or multiple common table expressions. Each such expression is a subquery whose result set can be referenced based on assigned table and column names. You can also think of common table expressions as view definitions that can only be referenced in this query.
10.1 SQL/XML Publishing Functions 283 Common table expressions are a useful query pattern to construct XML data. The query in Figure 10.24 uses one common table expression for each collection of nested elements, starting with the deepest level of the target document. The table expression items(pid, itemxml) is defined by a subselect that constructs the item elements that you see in Figure 10.22. Later parts of the query can obtain the item elements from the column itemxml and don’t need to be concerned with how they were constructed. In the table expression items, each constructed item element is paired with a pid to ease the selection of required items. The second table expression in Figure 10.24, orders(custid, orderxml), constructs the order elements and exposes them through the column orderxml. The order elements have to include the appropriate item elements, which are selected from the previously defined table expression items. No matter how complex the item elements are, the expression XMLAGG(i. itemxml) combines them into a sequence. The join predicate ensures that only those items are selected from the table expression items that belong to the current order. The final construction of the CustOrders element is now much simpler in Figure 10.24 than in Figure 10.23. Since the common table expressions items and orders have already constructed the inner repeated elements item and order, the outermost SELECT clause of the query only has to construct the element CustOrders, add any attributes, and use the expression XMLAGG(o. orderxml) to aggregate all the order elements for the respective customer. WITH items (pid, itemxml) AS (SELECT pid, XMLELEMENT(name "item", XMLATTRIBUTES(promoprice as "lowprice"), pid) FROM product), orders (custid, orderxml) AS (SELECT custid, XMLELEMENT(name "order", XMLATTRIBUTES(po.poid as "id"), XMLELEMENT(name "date", po.orderdate), (SELECT XMLAGG(i.itemxml) FROM items i WHERE XMLEXISTS('$PORDER/PurchaseOrder/item[partid=$PID]') ) ) FROM purchaseorder po ) SELECT XMLELEMENT(name "CustOrders", XMLATTRIBUTES(o.custid as "cid"), XMLAGG(o.orderxml)) FROM orders o WHERE o.custid = 1002 GROUP BY o.custid; Figure 10.24 XML construction with common table expressions
284 10.1.6 Chapter 10 Producing XML from Relational Data Comparing XMLAGG, XMLCONCAT, and XMLFOREST So far in this chapter you have seen examples that included the functions XMLAGG, XMLCONCAT, and XMLFOREST. Among these three functions, XMLFOREST is the only one that constructs new XML elements from relational input data. XMLAGG and XMLCONCAT both work with XML values that have already been constructed by other functions. XMLAGG is the only function that combines XML data from multiple rows into a single XML value in a single row. On the other hand, XMLCONCAT and XMLFOREST do not aggregate and do not directly affect the cardinality of a query result set. The differences and commonalities between these functions are summarized in Table 10.2. Table 10.2 Characteristics of XMLAGG, XMLCONCAT, and XMLFOREST XMLAGG XMLCONCAT XMLFOREST Constructs new XML elements No No Yes Concatenates input elements from two or more columns No Yes No Can have two or more arguments No Yes Yes Aggregates a set of XML values Yes No No Combines XML from multiple rows into a single XML value in one row Yes No No Input argument(s) must be of type XML Yes Yes No Is an abbreviation for multiple XMLELEMENT functions No No Yes 10.1.7 Conditional Element Construction It is possible to construct XML elements or attributes based on conditions. Both tag names and values can depend on conditions. For example, the query in Figure 10.25 uses an SQL CASE expression to construct the element ShipPriority if the status of an order is “Unshipped”. Otherwise it constructs the element status. The value of the element ShipPriority is determined by another CASE expression. If the order is more than 14 days old, the value of the element ShipPriority is "high", otherwise it is "low". SELECT XMLELEMENT(name "Order", XMLATTRIBUTES(poid as "id"), XMLELEMENT(name "Customer", custid), CASE WHEN status='Unshipped' THEN XMLELEMENT(name "ShipPriority", CASE WHEN orderdate < current_date - 14 days THEN 'high' ELSE 'low' END) Figure 10.25 Conditional element construction with a CASE expression
10.1 SQL/XML Publishing Functions 285 ELSE XMLELEMENT(name "Status", status) END) FROM purchaseorder WHERE orderdate > '02/15/2006'; <Order id="5000"> <Customer>1002</Customer> <ShipPriority>high</ShipPriority> </Order> <Order id="5006"> <Customer>1002</Customer> <Status>Shipped</Status> </Order> 2 record(s) selected. Figure 10.25 10.1.8 Conditional element construction with a CASE expression (Continued) Leading Zeros in Constructed Elements and Attributes In DB2 Version 8, DB2 9.1, and DB2 9.5 for Linux, UNIX, and Windows, constructing XML elements or attributes from relational DECIMAL or DOUBLE values introduces leading zeros in the constructed XML node. For example, the function XMLELEMENT(NAME "cost", price) produces elements such as this: <cost>0000000000000000000000000009.99</cost> The same applies to the XMLFOREST and XMLATTRIBUTES functions. These leading zeros are not generated in DB2 for z/OS and DB2 9.7 for Linux, UNIX, and Windows. In prior versions of DB2 for Linux, UNIX, and Windows, the zeros can be avoided by casting the numeric input values to type XML. The first part of Figure 10.26 shows the functions XMLELEMENT, XMLATTRIBUTES, and XMLFOREST that each take the DECIMAL column price as input. To avoid the leading zeros, add the XMLCAST function as shown in the second part of Figure 10.26. Casting to data type XML avoids the leading zeros, but the functions XMLFOREST and XMLATTRIBUTES do not accept arguments of type XML. Therefore, a second XMLCAST function is required to convert the number without the leading zeros to a character data type. -- DB2 for z/OS and DB2 9.7 and higher: XMLELEMENT(NAME "cost", price) XMLATTRIBUTES(price AS "COST") XMLFOREST(pid, price) Figure 10.26 (continues) Leading zeros in numeric output prior to DB2 9.7 for Linux, UNIX, and Windows
286 Chapter 10 Producing XML from Relational Data -- DB2 9.5 and earlier: XMLELEMENT(NAME "cost", XMLCAST(price AS XML)) XMLATTRIBUTES(XMLCAST(XMLCAST(price AS XML) AS VARCHAR(50)) AS "COST") XMLFOREST(pid, XMLCAST(XMLCAST(price AS XML) AS VARCHAR(50)) AS PRICE) Figure 10.26 Leading zeros in numeric output prior to DB2 9.7 for Linux, UNIX, and Windows (Continued) 10.1.9 Default Tagging of Relational Data with XMLROW and XMLGROUP In addition to the SQL/XML publishing functions discussed so far, DB2 for Linux, UNIX, and Windows offers the functions XMLROW and XMLGROUP. These functions do not provide any new capabilities but merely act as convenient abbreviations for combinations of the functions XMLELEMENT, XMLATTRIBUTES, XMLFOREST, and XMLAGG. In particular, XMLROW and XMLGROUP are simple to use because they construct XML with a default structure and default tag names. Let’s look at a few examples based on the relational data in Figure 10.27. POID -------5000 5001 5002 STATUS CUSTID ---------- -------Unshipped 1002 Shipped 1003 Shipped 1001 Figure 10.27 A subset of the purchaseorder table Figure 10.28 through Figure 10.33 show six queries with XMLROW and XMLGROUP. Each table contains a second query that produces the same result. This comparison clarifies how XMLROW and XMLGROUP are merely shortcuts for other SQL/XML functions. The right side of each table shows the constructed XML data based on the relational input data in Figure 10.27. The query in Figure 10.28 shows that XMLROW converts each row of the input table into an XML element <row> that has child elements for each of the selected columns. The column names are used as default element names. In DB2 for z/OS, where the function XMLROW is not available, you can use XMLELEMENT plus XMLFOREST instead. Optionally, XMLROW can produce attributes instead of child elements for the selected columns, as shown in Figure 10.29. The function XMLROW always generates one XML document in one result row for each qualifying input row. All the queries shown here can certainly have WHERE clauses to restrict the result sets. The function XMLGROUP differs from XMLROW in the cardinality of the produced result sets.
10.1 SQL/XML Publishing Functions 287 In particular, XMLGROUP is an abbreviation for XMLAGG plus XMLELEMENT and XMLFOREST and combines data from multiple or all input rows into one XML document. Figure 10.30 shows an example. SELECT XMLROW(poid, status, custid) FROM purchaseorder; SELECT XMLELEMENT(name "row", XMLFOREST(poid, status, custid) ) FROM purchaseorder; <row> <POID>5000</POID> <STATUS>Unshipped</STATUS> <CUSTID>1002</CUSTID> </row> <row> <POID>5001</POID> <STATUS>Shipped</STATUS> <CUSTID>1003</CUSTID> </row> <row> <POID>5002</POID> <STATUS>Shipped</STATUS> <CUSTID>1001</CUSTID> </row> 3 record(s) selected. Figure 10.28 Default tagging with XMLROW SELECT XMLROW(poid, status, custid OPTION AS ATTRIBUTES) FROM purchaseorder; SELECT XMLELEMENT(name "row", XMLATTRIBUTES(poid, status, custid) FROM purchaseorder; ) <row POID="5000" STATUS="Unshipped" CUSTID="1002"/> <row POID="5001" STATUS="Shipped" CUSTID="1003"/> <row POID="5002" STATUS="Shipped" CUSTID="1001"/> 3 record(s) selected. Figure 10.29 Default tagging with XMLROW, using attributes
288 Chapter 10 SELECT XMLGROUP(poid, status, custid) FROM purchaseorder; SELECT XMLELEMENT(name "rowset", XMLAGG( XMLELEMENT(name "row", XMLFOREST(poid, status, custid)))) FROM purchaseorder; Producing XML from Relational Data <rowset> <row> <POID>5000</POID> <STATUS>Unshipped</STATUS> <CUSTID>1002</CUSTID> </row> <row> <POID>5001</POID> <STATUS>Shipped</STATUS> <CUSTID>1003</CUSTID> </row> <row> <POID>5002</POID> <STATUS>Shipped</STATUS> <CUSTID>1001</CUSTID> </row> </rowset> 1 record(s) selected. Figure 10.30 Default tagging with XMLGROUP Just like XMLROW, the function XMLGROUP also has an option to produce attributes instead of elements (see Figure 10.31). SELECT XMLGROUP(poid, status, custid OPTION AS ATTRIBUTES) FROM purchaseorder; SELECT XMLELEMENT(name "rowset", XMLAGG( XMLELEMENT(name "row", XMLATTRIBUTES(poid, status, custid)))) FROM purchaseorder; <rowset> <row POID="5000" STATUS="Unshipped" CUSTID="1002"/> <row POID="5001" STATUS="Shipped" CUSTID="1003"/> <row POID="5002" STATUS="Shipped" CUSTID="1001"/> </rowset> 1 record(s) selected. Figure 10.31 Default tagging with XMLGROUP, using attributes If you use a GROUP BY clause, the function XMLGROUP behaves just like the XMLAGG function. One XML document is constructed for each group. The query in Figure 10.32 groups the result by the status column, which contains the values Unshipped and Shipped. Therefore two documents are generated—one that contains shipped orders and one with unshipped orders.
10.1 SQL/XML Publishing Functions SELECT XMLGROUP(poid, status, custid OPTION AS ATTRIBUTES) FROM purchaseorder GROUP BY status; SELECT XMLELEMENT(name "rowset", XMLAGG( XMLELEMENT(name "row", XMLATTRIBUTES(poid, status, custid)))) FROM purchaseorder ; GROUP BY status 289 <rowset> <row POID="5000" STATUS="Unshipped" CUSTID="1002"/> </rowset> <rowset> <row POID="5001" " STATUS="Shipped CUSTID="1003"/> <row POID="5002" " STATUS="Shipped CUSTID="1001"/> </rowset> 2 record(s) selected. Figure 10.32 Default tagging and grouping with XMLGROUP Both XMLGROUP and XMLROW have options that allow you to change the element names row and rowset to custom names. The query in Figure 10.33 uses porder and Orders instead. SELECT XMLGROUP(poid, status, custid OPTION AS ATTRIBUTES ROW "porder" ROOT "Orders") FROM purchaseorder GROUP BY status; SELECT XMLELEMENT(name "Orders", XMLAGG( XMLELEMENT(name "porder", XMLATTRIBUTES(poid, status, custid)))) FROM purchaseorder GROUP BY status; Figure 10.33 10.1.10 <Orders> <porder POID="5000" STATUS="Unshipped" CUSTID="1002"/> </Orders> <Orders> <porder POID="5001" STATUS="Shipped" CUSTID="1003"/> <porder POID="5002" " STATUS="Shipped CUSTID="1001"/> </Orders> 2 record(s) selected. XMLGROUP with options for non-default tag names GUI-Based Definition of SQL/XML Publishing Queries Manual coding of SQL/XML publishing queries can become a complex task if you need to generate complex XML documents. IBM InfoSphere Data Architect, previously known as Rational Data Architect (RDA), provides relief with a graphical user interface that lets you define the mapping from relational source tables to a target XML format, as shown in Figure 10.34. In the background, RDA generates an SQL/XML publishing statement that implements the desired mapping. At the time of writing, this feature was not available in IBM Data Studio Developer.
290 Chapter 10 Figure 10.34 10.1.11 Producing XML from Relational Data Relational to XML mapping in InfoSphere Data Architect (RDA) Constructing Comments, Processing Instructions, and Text Nodes The SQL/XML functions XMLCOMMENT, XMLPI, and XMLTEXT are available to construct comment nodes, processing instruction nodes, and text nodes, respectively. Most applications that construct XML documents do not need to use these functions. Please refer to the DB2 Information Center or the latest DB2 SQL Reference if you require more details on these functions. 10.1.12 Legacy Functions The legacy functions XML2CLOB and REC2XML are not part of the SQL/XML standard and have been superseded by SQL/XML standard functions. The function XML2CLOB was introduced in DB2 V8 to convert constructed XML data from type XML to type CLOB. The SQL/XML function XMLSERIALIZE supersedes the XML2CLOB function. It can convert XML type data to CLOB, BLOB, VARCHAR, or CHAR, and can optionally also control the generation of XML declarations (see section 10.3). The function XML2CLOB is supported in DB2 9.x for backward compatibility only. It is recommended to use XMLSERIALIZE instead of XML2CLOB. REC2XML is another legacy function that only exists for backward compatibility. It was intro- duced to allow queries to retrieve relational rows in a default XML format. It is recommended that you use the new functions XMLROW, XMLELEMENT, or XMLATTRIBUTES instead of REC2XML. 10.2 USING XQUERY CONSTRUCTORS WITH RELATIONAL INPUT XQuery element and attribute constructors were introduced in section 8.4, Constructing XML Data. They allow you to construct new XML elements and attributes and to nest them to build new documents. In section 8.4 the constructed XML data contains values that are extracted from
10.2 Using XQuery Constructors with Relational Input 291 other XML documents. In this section we show how XQuery element and attribute constructors can also use values from relational columns. Consider Figure 10.35 as an example. Since XQuery element and attribute constructors are XQuery expressions, just like XPath or FLWOR expressions, you can enclose them in an XMLQUERY function in order to include them in an SQL statement. Direct element and attribute constructors allow you to simply type the tags of XML documents that you want to construct from each input row. Wherever you want a relational column to provide an attribute or element value, simply use the column name as an uppercase variable and enclose it in curly brackets. This turns the column name into an expression that is evaluated when the query executes. In Figure 10.35, $POID, $ORDERDATE, and $STATUS refer to the relational columns poid, orderdate, and status. The curly brackets ensure that the constructed elements and attributes contain the column values, and not the column names. SELECT XMLQUERY('<order id="{$POID}"> <details> <date>{$ORDERDATE}</date> <status>{$STATUS}</status> </details> </order>') FROM purchaseorder WHERE poid = 5000; <order id="5000"> <details> <date> 2006-02-18</date> <status>Unshipped</status> </details> </order> 1 record(s) selected. Figure 10.35 Using element constructors in a query XQuery element and attribute constructors are not available in DB2 9 for z/OS, but you can achieve the same construction with the SQL/XML publishing functions. For example, the query in Figure 10.36 runs on all platforms and produces the same result as the query in Figure 10.35. SELECT XMLELEMENT(name "order", XMLATTRIBUTES(poid AS "id"), XMLELEMENT(name "details", XMLFOREST(orderdate AS "date", status AS "status"))) FROM purchaseorder WHERE poid = 5000; Figure 10.36 SQL/XML publishing functions that produce the same result as Figure 10.35
292 Chapter 10 Producing XML from Relational Data The XQuery element and attribute constructors can be combined with the SQL/XML publishing functions, which is useful if you want to construct documents that contain values from multiple rows. Such aggregation is best done with the SQL/XML function XMLAGG. However, the XML fragments that are being aggregated can be generated with XQuery constructor expressions, as shown in Figure 10.37. The XMLQUERY function contains XQuery element and attribute constructors to produce the order elements. Since the result type of the XMLQUERY function is always XML, it produces a valid input type for the XMLAGG function. SELECT XMLELEMENT(name "CustOrders", XMLAGG( XMLQUERY('<order cid="{$CUSTID}" date="{$ORDERDATE}"> {$POID} </order>') )) FROM purchaseorder WHERE custid IN (1001,1002) GROUP BY custid; <CustOrders> <order cid="1001" date="2004-02-29">5002</order> </CustOrders> <CustOrders> <order cid="1002" date="2005-02-28">5003</order> <order cid="1002" date="2006-02-18">5000</order> <order cid="1002" date="2006-03-01">5006</order> </CustOrders> 2 record(s) selected. Figure 10.37 10.3 SQL/XML publishing functions and XQuery constructor expressions XML DECLARATIONS FOR CONSTRUCTED XML DATA When you retrieve XML data from DB2, either from XML columns or constructed from relational columns, you might want each document to have an XML declaration with an encoding attribute, such as <?xml version="1.0" encoding="UTF-8"?> The generation of XML declarations is controlled by the application that interacts with DB2. Chapter 20, Understanding XML Data Encoding, and Chapter 21, Developing XML Applications with DB2, provide further details. Be aware that the DB2 Command Line Processor (CLP) is an application that by default retrieves XML data without XML declarations. This default behavior can be changed. If you invoke the
10.3 XML Declarations for Constructed XML Data 293 CLP with the –d option, such as db2 –t –d, an XML declaration is added to each document that you retrieve. During retrieval, DB2 also converts the XML data to the code page of the application, which can depend on your operating system. On AIX 5.3, constructed XML data retrieved via the CLP may carry the XML declarations shown in Figure 10.38. Since each XML element returned in Figure 10.38 is a separate XML document (one per row), each result row has its own XML declaration. SELECT XMLELEMENT(NAME "pnum", pid) AS pnum_elem FROM product; PNUM_ELEM ------------------------------------------------------------<?xml version="1.0" encoding="UTF-8"?><pnum>100-100-01</pnum> <?xml version="1.0" encoding="UTF-8"?><pnum>100-101-01</pnum> <?xml version="1.0" encoding="UTF-8"?><pnum>100-103-01</pnum> <?xml version="1.0" encoding="UTF-8"?><pnum>100-201-01</pnum> 4 record(s) selected. Figure 10.38 Constructed XML document with XML declarations If you use SPUFI to run the same query on DB2 for z/OS, the result set may look like the one shown in Figure 10.39. <?xml <?xml <?xml <?xml version="1.0" version="1.0" version="1.0" version="1.0" encoding="IBM285"?><pnum>100-101-01</pnum> encoding="IBM285"?><pnum>100-100-01</pnum> encoding="IBM285"?><pnum>100-103-01</pnum> encoding="IBM285"?><pnum>100-201-01</pnum> DSNE610I NUMBER OF ROWS DISPLAYED IS 4 Figure 10.39 Result set with XML declarations in SPUFI For ODBC and embedded SQL applications, DB2 for z/OS adds an XML declaration to the returned XML data by default. For Java and .NET applications, the generation of an XML declaration depends on the methods used to retrieve the data (see Chapter 21). Independent of the API and the platform that DB2 is running on, you can always control (include/exclude) the generation of XML declarations with the XMLSERIALIZE function. To do so, wrap the XMLSERIALIZE function around the XML type column that the query produces and use the keywords EXCLUDING XMLDECLARATION or INCLUDING XMLDECLARATION as needed (see Figure 10.40). The XMLSERIALIZE function also changes the return type of the constructed XML data from type XML to a character or binary type. In DB2 for Linux, UNIX, and Windows the target types of the XMLSERIALIZE function can be CLOB, BLOB, CHAR, and VARCHAR. DB2 for z/OS allows types CLOB and BLOB, and CLOB can further be cast to VARCHAR if the size allows.
294 Chapter 10 Producing XML from Relational Data SELECT XMLSERIALIZE(XMLELEMENT(NAME "pnum", pid) AS CLOB(500) EXCLUDING XMLDECLARATION) FROM product; SELECT XMLSERIALIZE(XMLELEMENT(NAME "pnum", pid) AS CLOB(500) INCLUDING XMLDECLARATION) FROM product; Figure 10.40 10.4 Using XMLSERIALIZE to suppress or include XML declarations INSERTING CONSTRUCTED XML DATA INTO XML COLUMNS The construction of XML data in DB2 composes an XML document tree that is internally represented in DB2’s parsed hierarchical XML format. All examples that we discussed in this chapter so far return the constructed XML data to the application that issued the query. When XML data is transferred from the DB2 server to a client application, the XML data is implicitly serialized; that is, converted to its textual representation. You can also perform explicit serialization with the XMLSERIALIZE function to choose a return type such as CLOB or VARCHAR and to control the generation of an XML declaration. Either way, the XML data is sent to the application as text. Instead of serializing a constructed document tree to text and returning it to the client, the document tree can also be inserted into an XML column. In section 3.1, Understanding XML Document Trees, we explained that the XQuery Data Model requires a document tree to have a document node. The document node is the parent of the root element. A document node is not visible in the textual representation of an XML document, and not automatically generated when you construct XML data that is returned in text format to the application. However, a document node must be added if you want to insert a constructed document into an XML column. The SQL/XML function XMLDOCUMENT constructs such a document node. Suppose you want to insert constructed XML documents into the following table: CREATE TABLE orders(orderinfo XML) Figure 10.41 shows the usage of the XMLDOCUMENT function as the outermost function for the construction of XML data. If you omit the XMLDOCUMENT function, the INSERT statement fails with error SQL20345N The XML value is not a well-formed document with a single root. INSERT INTO orders(orderinfo) SELECT XMLDOCUMENT( XMLELEMENT(name "order", XMLATTRIBUTES(poid AS "id"), XMLFOREST(orderdate, status)) ) FROM purchaseorder; Figure 10.41 Insert requires construction of an XML document node
10.5 Summary 295 A document node is equally required if you use XQuery direct element and attribute constructors instead of the SQL/XML publishing functions. Figure 10.42 shows the insertion of a document that is constructed in XQuery and a document node is added with the XMLDOCUMENT function. INSERT INTO orders(orderinfo) SELECT XMLDOCUMENT( XMLQUERY('<order id="{$POID}"> <date>{$ORDERDATE}</date> <status>{$STATUS}</status> </order>') ) FROM purchaseorder; Figure 10.42 XMLDOCUMENT plus XQuery constructors Remember that XMLDOCUMENT is an SQL function. The XQuery language includes a corresponding document node constructor, document{ }, which you see used in Figure 10.43. INSERT INTO orders(orderinfo) SELECT XMLQUERY('document{<order id="{$POID}"> <date>{$ORDERDATE}</date> <status>{$STATUS}</status> </order>}') FROM purchaseorder; Figure 10.43 10.5 Constructing a document node in XQuery SUMMARY Developing and working with XML applications is not only about consuming and processing XML data, but often also about creating and publishing XML data. In particular, generating XML documents from existing data in relational tables is a common requirement. Constructing XML data is supported through SQL/XML functions since version 8 of both DB2 for z/OS and DB2 for Linux, UNIX, and Windows. These SQL/XML publishing functions, also sometimes called constructor functions, can be used in the SELECT clause of any valid SQL query. They take the columns of a relational result set as input and produce XML data as output. For example, the XMLELEMENT function constructs XML elements, the XMLATTRIBUTES function constructs XML attributes, and the XMLAGG function aggregates XML elements from multiple rows into XML documents. Remember that XMLAGG works much like any other SQL aggregation function, taking multiple rows as input and producing a single row as output. Just like XML elements are nested in the tree structure of an XML document, it is common to nest the SQL/XML publishing functions to construct a correspondingly nested document structure. IBM InfoSphere Data Architect also provides a GUI interface to create SQL/XML publishing queries visually.
296 Chapter 10 Producing XML from Relational Data In addition to the SQL/XML publishing functions, DB2 for Linux, UNIX, and Windows also supports XQuery direct element and attribute constructors. They provide an alternative and sometimes simpler way of constructing XML data from relational tables. Any form of XML construction in DB2 produces data of type XML and is therefore fully compatible with all other pureXML features. For example, constructed XML documents can be inserted into XML columns if they have an explicitly constructed XML document node at the top.
C H A P T E R 11 Converting XML to Relational Data his chapter describes methods to convert XML documents to rows in relational tables. This conversion is commonly known as shredding or decomposing of XML documents. Given the rich support for XML columns in DB2 you might wonder in which cases it can still be useful or necessary to convert XML data to relational format. One common reason for shredding is that existing SQL applications might still require access to the data in relational format. For example, legacy applications, packaged business applications, or reporting software do not always understand XML and have fixed relational interfaces. Therefore you might sometimes find it useful to shred all or some of the data values of an incoming XML document into rows and columns of relational tables. T In this chapter you learn: • The advantages and disadvantages of shredding and of different shredding methods (section 11.1) • How to shred XML data to relational tables using INSERT statements that contain the XMLTABLE function (section 11.2) • How to use XML Schema annotations that map and shred XML documents to relational tables (section 11.3) 11.1 ADVANTAGES AND DISADVANTAGES OF SHREDDING The concept of XML shredding is illustrated in Figure 11.1. In this example, XML documents with customer name, address, and phone information are mapped to two relational tables. The documents can contain multiple phone elements because there is a one-to-many relationship 297
298 Chapter 11 Converting XML to Relational Data between customers and phones. Hence, phone numbers are shredded into a separate table. Each repeating element, such as phone, leads to an additional table in the relational target schema. Suppose the customer information can also contain multiple email addresses, multiple accounts, a list of most recent orders, multiple products per order, and other repeating items. The number of tables required in the relational target schema can increase very quickly. Shredding XML into a large number of tables can lead to a complex and unnatural fragmentation of your logical business objects that makes application development difficult and error-prone. Querying the shredded data or reassembling the original documents may require complex multiway joins. <customerinfo Cid="1003"> <name>Robert Shoemaker</name> CID NAME STREET CITY <addr country="Canada"> 1003 Robert Shoemaker 845 Kean Street Aurora <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> CREATE TABLE address( <phone type="home">416-555-2937</phone> cid INTEGER, <phone type="cell">905-555-8743</phone> name VARCHAR(30), </customerinfo> street VARCHAR(40), city VARCHAR(30)) CID 1003 1003 1003 Figure 11.1 PHONETYPE work home cell PHONENUM 905-555-7258 416-555-2937 905-555-8743 CREATE TABLE phones( cid INTEGER, phonetype VARCHAR(10), phonenum VARCHAR(20)) Shredding of an XML document Depending on the complexity, variability, and purpose of your XML documents, shredding may or may not be a good option. Table 11.1 summarizes the pros and cons of shredding XML data to relational tables. Table 11.1 When Shredding Is and Isn’t a Good Option Shredding Can Be Useful When… Shredding Is Not A Good Option When… • Incoming XML data is just feeding an existing relational database. • Your XML data is complex and nested, and difficult to map to a relational schema. • The XML documents do not represent logical business objects that should be preserved. • Mapping your XML format to a relational schema leads to a large number of tables. • Your primary goal is to enable existing relational applications to access XML data. • Your XML Schema is highly variable or tends to change over time. • You are happy with your relational schema and would like to use it as much as possible. • Your primary goal is to manage XML documents as intact business objects.
11.1 Advantages and Disadvantages of Shredding Table 11.1 299 When Shredding Is and Isn’t a Good Option (Continued) Shredding Can Be Useful When… Shredding Is Not A Good Option When… • The structure of your XML data is such that it can easily be mapped to relational tables. • You frequently need to reconstruct the shredded documents or parts of them. • Your XML format is relatively stable and changes to it are rare. • Ingesting XML data into the database at a high rate is important for your application. • You rarely need to reconstruct the shredded documents. • Querying or updating the data with SQL is more important than insert performance. In many XML application scenarios the structure and usage of the XML data does not lend itself to easy and efficient shredding. This is the reason why DB2 supports XML columns that allow you to index and query XML data without conversion. Sometimes you will find that your application requirements can be best met with partial shredding or hybrid XML storage. • Partial shredding means that only a subset of the elements or attributes from each incoming XML document are shredded into relational tables. This is useful if a relational application does not require all data values from each XML document. In cases where shredding each document entirely is difficult and requires a complex relational target schema, partial shredding can simplify the mapping to the relational schema significantly. • Hybrid XML storage means that upon insert of an XML document into an XML column, selected element or attribute values are extracted and redundantly stored in relational columns. If you choose to shred XML documents, entirely or partially, DB2 provides you with a rich set of capabilities to do some or all of the following: • Perform custom transformations of the data values before insertion into relational columns. • Shred the same element or attribute value into multiple columns of the same table or different tables. • Shred multiple different elements or attributes into the same column of a table. • Specify conditions that govern when certain elements are or are not shredded. For example, shred the address of a customer document only if the country is Canada. • Validate XML documents with an XML Schema during shredding. • Store the full XML document along with the shredded data.
300 Chapter 11 Converting XML to Relational Data DB2 9 for z/OS and DB2 9.x for Linux, UNIX, and Windows support two shredding methods: • SQL INSERT statements that use the XMLTABLE function. This function navigates into an input document and produces one or multiple relational rows for insert into a relational table. • Decomposition with an annotated XML Schema. Since an XML Schema defines the structure of XML documents, annotations can be added to the schema to define how elements and attributes are mapped to relational tables. Table 11.2 and Table 11.3 discuss the advantages and disadvantages of the XMLTABLE method and the annotated schema method. Table 11.2 Considerations for the XMLTABLE Method Advantages of the XMLTABLE Method Disadvantages of the XMLTABLE Method • It allows you to shred data even if you do not have an XML Schema. • For each target table that you want to shred into you need one INSERT statement. • It does not require you to understand the XML Schema language or to understand schema annotations for decomposition. • You might have to combine multiple INSERT statements in a stored procedure. • It is generally easier to use than annotated schemas because it is based on SQL and XPath. • There is no GUI support for implementing the INSERT statements and the required XMLTABLE functions. You need to be familiar with XPath and SQL/XML. • You can use familiar XPath, XQuery, or SQL functions and expressions to extract and optionally modify the data values. • It often requires no or little work during XML Schema evolution. • The shredding process can consume data from multiple XML and relational sources, if needed, such as values from DB2 sequences or look-up data from other relational tables. • It can often provide better performance than annotated schema decompositions.
11.2 Shredding with the XMLTABLE Function Table 11.3 301 Considerations for Annotated Schema Decomposition Advantages of the Annotated Schema Method Disadvantages of the Annotated Schema Method • The mapping from XML to relational tables can be defined using a GUI in IBM Data Studio Developer. • It does not allow shredding without an XML Schema. • If you shred complex XML data into a large number of tables, the coding effort can be lower than with the XMLTABLE approach. • You might have to manually copy annotations when you start using a new version of your XML Schema. • It offers a bulk mode with detailed diagnostics if some documents fail to shred. • Despite the GUI support, you need to be familiar with the XML Schema language for all but simple shredding scenarios. • Annotating an XML Schema can be complex, if the schema itself is complex. 11.2 SHREDDING WITH THE XMLTABLE FUNCTION The XMLTABLE function is an SQL table function that uses XQuery expressions to create relational rows from an XML input document. For details on the XMLTABLE function, see Chapter 7, Querying XML Data with SQL/XML. In this section we describe how to use the XMLTABLE function in an SQL INSERT statement to perform shredding. We use the shredding scenario in Figure 11.1 as an example. The first step is to create the relational target tables, if they don’t already exist. For the scenario in Figure 11.1 the target tables are defined as follows: CREATE TABLE address(cid INTEGER, name VARCHAR(30), street VARCHAR(40), city VARCHAR(30)) CREATE TABLE phones(cid INTEGER, phonetype VARCHAR(10), phonenum VARCHAR(20)) Based on the definition of the target tables you construct the INSERT statements that shred incoming XML documents. The INSERT statements have to be of the form INSERT INTO … SELECT … FROM … XMLTABLE, as shown in Figure 11.2. Each XMLTABLE function contains a parameter marker (“?”) through which an application can pass the XML document that is to be shredded. SQL typing rules require the parameter marker to be cast to the appropriate data type. The SELECT clause selects columns produced by the XMLTABLE function for insert into the address and phones tables, respectively.
302 Chapter 11 Converting XML to Relational Data INSERT INTO address(cid, name, street, city) SELECT x.custid, x.custname, x.str, x.place FROM XMLTABLE('$i/customerinfo' PASSING CAST(? AS XML) AS "i" COLUMNS custid INTEGER PATH '@Cid', custname VARCHAR(30) PATH 'name', str VARCHAR(40) PATH 'addr/street', place VARCHAR(30) PATH 'addr/city' ) AS x ; INSERT INTO phones(cid, phonetype, phonenum) SELECT x.custid, x.ptype, x.number FROM XMLTABLE('$i/customerinfo/phone' PASSING CAST(? AS XML) AS "i" COLUMNS custid INTEGER PATH '../@Cid', number VARCHAR(15) PATH '.', ptype VARCHAR(10) PATH './@type') AS x ; Figure 11.2 Inserting XML element and attribute values into relational columns To populate the two target tables as illustrated in Figure 11.1, both INSERT statements have to be executed with the same XML document as input. One approach is that the application issues both INSERT statements in one transaction and binds the same XML document to the parameter markers for both statements. This approach works well but can be optimized, because the same XML document is sent from the client to the server and parsed at the DB2 server twice, once for each INSERT statement. This overhead can be avoided by combining both INSERT statements in a single stored procedure. The application then only makes a single stored procedure call and passes the input document once, regardless of the number of INSERT statements in the stored procedure. Chapter 18, Using XML in Stored Procedures, UDFs, and Triggers, demonstrates such a stored procedure as well as other examples of manipulating XML data in stored procedures and userdefined functions. Alternatively, the INSERT statements in Figure 11.2 can read a set of input documents from an XML column. Suppose the documents have been loaded into the XML column info of the customer table. Then you need to modify one line in each of the INSERT statements in Figure 11.2 to read the input document from the customer table: FROM customer, XMLTABLE('$i/customerinfo' PASSING info AS "i" Loading the input documents into a staging table can be advantageous if you have to shred many documents. The LOAD utility parallelizes the parsing of XML documents, which reduces the time to move the documents into the database. When the documents are stored in an XML column in parsed format, the XMLTABLE function can shred the documents without XML parsing. The INSERT statements can be enriched with XQuery or SQL functions or joins to tailor the shredding process to specific requirements. Figure 11.3 provides an example. The SELECT clause
11.2 Shredding with the XMLTABLE Function 303 contains the function RTRIM to remove trailing blanks from the column x.ptype. The row-generating expression of the XMLTABLE function contains a predicate that excludes home phone numbers from being shredded into the target table. The column-generating expression for the phone numbers uses the XQuery function normalize-space, which strips leading and trailing whitespace and replaces each internal sequence of whitespace characters with a single blank character. The statement also performs a join to the lookup table areacodes so that a phone number is inserted into the phones table only if its area code is listed in the areacodes table. INSERT INTO phones(cid, phonetype, phonenum) SELECT x.custid, RTRIM(x.ptype), x.number FROM areacodes a, XMLTABLE('$i/customerinfo/phone[@type != "home"]' PASSING CAST(? AS XML) AS "i" COLUMNS custid INTEGER PATH '../@Cid', number VARCHAR(15) PATH 'normalize-space(.)', ptype VARCHAR(10) PATH './@type') AS x WHERE SUBSTR(x.number,1,3) = a.code; Figure 11.3 11.2.1 Using functions and joins to customize the shredding Hybrid XML Storage In many situations the complexity of the XML document structures makes shredding difficult, inefficient, and undesirable. Besides the performance penalty of shredding, scattering the values of an XML document across a large number of tables can make it difficult for an application developer to understand and query the data. To improve XML insert performance and to reduce the number of tables in your database, you may want to store XML documents in a hybrid manner. This approach extracts the values of selected XML elements or attributes and stores them in relational columns alongside the full XML document. The example in the previous section used two tables, address and phones, as the target tables for shredding the customer documents. You might prefer to use just a single table that contains the customer cid, name, and city values in relational columns and the full XML document with the repeating phone elements and other information in an XML column. You can define the following table: CREATE TABLE hybrid(cid INTEGER NOT NULL PRIMARY KEY, name VARCHAR(30), city VARCHAR(25), info XML) Figure 11.4 shows the INSERT statement to populate this table. The XMLTABLE function takes an XML document as input via a parameter marker. The column definitions in the XMLTABLE function produce four columns that match the definition of the target table hybrid. The rowgenerating expression in the XMLTABLE function is just $i, which produces the full input document. This expression is the input for the column-generating expressions in the COLUMNS clause of the XMLTABLE function. In particular, the column expression '.' returns the full input
304 Chapter 11 Converting XML to Relational Data document as-is and produces the XML column doc for insert into the info column of the target table. INSERT INTO hybrid(cid, name, city, info) SELECT x.custid, x.custname, x.city, x.doc FROM XMLTABLE('$i' PASSING CAST(? AS XML) AS "i" COLUMNS custid INTEGER PATH 'customerinfo/@Cid', custname VARCHAR(30) PATH 'customerinfo/name', city VARCHAR(25) PATH 'customerinfo/addr/city', doc XML PATH '.' ) AS x; Figure 11.4 Storing an XML document in a hybrid fashion It is currently not possible to define check constraints in DB2 to enforce the integrity between relational columns and values in an XML document in the same row. You can, however, define INSERT and UPDATE triggers on the table to populate the relational columns automatically whenever a document is inserted or updated. Triggers are discussed in Chapter 18, Using XML in Stored Procedures, UDFs, and Triggers. It can be useful to test such INSERT statements in the DB2 Command Line Processor (CLP). For this purpose you can replace the parameter marker with a literal XML document as shown in Figure 11.5. The literal document is a string that must be enclosed in single quotes and converted to the data type XML with the XMLPARSE function. Alternatively, you can read the input document from the file system with one of the UDFs that were introduced in Chapter 4, Inserting and Retrieving XML Data. The use of a UDF is demonstrated in Figure 11.6. INSERT INTO hybrid(cid, name, city, info) SELECT x.custid, x.custname, x.city, x.doc FROM XMLTABLE('$i' PASSING XMLPARSE(document '<customerinfo Cid=”1001”> <name>Kathy Smith</name> <addr country=”Canada”> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <phone type=”work”>905-555-7258</phone> </customerinfo>') AS "i" COLUMNS custid INTEGER PATH 'customerinfo/@Cid', custname VARCHAR(30) PATH 'customerinfo/name', city VARCHAR(25) PATH 'customerinfo/addr/city', doc XML PATH '.' ) AS x; Figure 11.5 Hybrid insert statement with a literal XML document
11.2 Shredding with the XMLTABLE Function 305 INSERT INTO hybrid(cid, name, city, info) SELECT x.custid, x.custname, x.city, x.doc FROM XMLTABLE('$i' PASSING XMLPARSE(document blobFromFile('/xml/mydata/cust0037.xml')) AS "i" COLUMNS custid INTEGER PATH 'customerinfo/@Cid', custname VARCHAR(30) PATH 'customerinfo/name', city VARCHAR(25) PATH 'customerinfo/addr/city', doc XML PATH '.' ) AS x; Figure 11.6 Hybrid insert statement with a “FromFile” UDF The insert logic in Figure 11.4, Figure 11.5, and Figure 11.6 is identical. The only difference is how the input document is provided: via a parameter marker, as a literal string that is enclosed in single quotes, or via a UDF that reads a document from the file system. 11.2.2 Relational Views over XML Data You can create relational views over XML data using XMLTABLE expressions. This allows you to provide applications with a relational or hybrid view of the XML data without actually storing the data in a relational or hybrid format. This can be useful if you want to avoid the overhead of converting large amounts of XML data to relational format. The basic SELECT … FROM … XMLTABLE constructs that were used in the INSERT statements in the previous section can also be used in CREATE VIEW statements. As an example, suppose you want to create a relational view over the elements of the XML documents in the customer table to expose the customer identifier, name, street, and city values. Figure 11.7 shows the corresponding view definition plus an SQL query against the view. CREATE VIEW custview(id, name, street, city) AS SELECT x.custid, x.custname, x.str, x.place FROM customer, XMLTABLE('$i/customerinfo' PASSING info AS "i" COLUMNS custid INTEGER PATH '@Cid', custname VARCHAR(30) PATH 'name', str VARCHAR(40) PATH 'addr/street', place VARCHAR(30) PATH 'addr/city' ) AS x; SELECT id, name FROM custview WHERE city = 'Aurora'; ID NAME ----------- -----------------------------1003 Robert Shoemaker 1 record(s) selected. Figure 11.7 Creating a view over XML data
306 Chapter 11 Converting XML to Relational Data The query over the view in Figure 11.7 contains an SQL predicate for the city column in the view. The values in the city column come from an XML element in the underlying XML column. You can speed up this query by creating an XML index on /customerinfo/addr/city for the info column of the customer table. DB2 9 for z/OS and DB2 9.7 for Linux, UNIX, and Windows are able to convert the relational predicate city = 'Aurora' into an XML predicate on the underlying XML column so that the XML index can be used. This is not possible in DB2 9.1 and DB2 9.5 for Linux, UNIX, and Windows. In these previous versions of DB2, include the XML column in the view definition and write the search condition as an XML predicate, as in the following query. Otherwise an XML index cannot be used. SELECT id, name FROM custview WHERE XMLEXISTS('$INFO/customerinfo/addr[city = "Aurora"]') 11.3 SHREDDING WITH ANNOTATED XML SCHEMAS This section describes another approach to shredding XML documents into relational tables. The approach is called annotated schema shredding or annotated schema decomposition because it is based on annotations in an XML Schema. These annotations define how XML elements and attributes in your XML data map to columns in your relational tables. To perform annotated schema shredding, take the following steps: • Identify or create the relational target tables that will hold the shredded data. • Annotate your XML Schema to define the mapping from XML to the relational tables. • Register the XML Schema in the DB2 XML Schema Repository. • Shred XML documents with Command Line Processor commands or built-in stored procedures. Assuming you have defined the relational tables that you want to shred into, let’s look at annotating an XML Schema. 11.3.1 Annotating an XML Schema Schema annotations are additional elements and attributes in an XML Schema to provide mapping information. DB2 can use this information to shred XML documents to relational tables. The annotations do not change the semantics of the original XML Schema. If a document is valid for the annotated schema then it is also valid for the original schema, and vice versa. You can use an annotated schema to validate XML documents just like the original XML Schema. For an introduction to XML Schemas, see Chapter 16, Managing XML Schemas. The following is one line from an XML Schema: <xs:element name="street" type="xs:string" minOccurs="1"/>
11.3 Shredding with Annotated XML Schemas 307 This line defines an XML element called street and declares that its data type is xs:string and that this element has to occur at least once. You can add a simple annotation to this element definition to indicate that the element should be shredded into the column STREET of the table ADDRESS. The annotation consists of two additional attributes in the element definition, as follows: <xs:element name="street" type="xs:string" minOccurs="1" db2-xdb:rowSet="ADDRESS" db2-xdb:column="STREET"/> The same annotation can also be provided as schema elements instead of attributes, as shown next. You will see later in Figure 11.8 why this can be useful. <xs:element name="street" type="xs:string" minOccurs="1"> <xs:annotation> <xs:appinfo> <db2-xdb:rowSetMapping> <db2-xdb:rowSet>ADDRESS</db2-xdb:rowSet> <db2-xdb:column>STREET</db2-xdb:column> </db2-xdb:rowSetMapping> </xs:appinfo> </xs:annotation> <xs:element/> The prefix xs is used for all constructs that belong to the XML Schema language, and the prefix db2-xdb is used for all DB2-specific schema annotations. This provides a clear distinction and ensures that the annotated schema validates the same XML documents as the original schema. There are 14 different types of annotations. They allow you to specify what to shred, where to shred to, how to filter or transform the shredded data, and in which order to execute inserts into the target tables. Table 11.4 provides an overview of the available annotations, broken down into logical groupings by user task. The individual annotations are further described in Table 11.5. Table 11.4 Overview and Grouping of Schema Annotations If You Want to Use This Annotation Specify the target tables to shred into db2-xdb:rowSet db2-xdb:column db2-xdb:SQLSchema db2-xdb:defaultSQLSchema Specify what to shred db2-xdb:contentHandling Transform data values while shredding db2-xdb:expression db2-xdb:normalization db2-xdb:truncate Filter data db2-xdb:condition db2-xdb:locationPath (continues)
308 Table 11.4 Chapter 11 Converting XML to Relational Data Overview and Grouping of Schema Annotations (Continued) If You Want to Use This Annotation Map an element or attribute to multiple columns db2-xdb:rowSetMapping Map several elements or attributes to the same column db2-xdb:table Define the order in which rows are inserted into the target table, to avoid referential integrity violations db2-xdb:rowSetOperationOrder db2-xdb:order Table 11.5 XML Schema Annotations Annotation Description db2-xdb:defaultSQLSchema The default relational schema for the target tables. db2-xdb:SQLSchema Overrides the default schema for individual tables. db2-xdb:rowSet The table name that the element or attribute is mapped to db2-xdb:column The column name that the element or attribute is mapped to. db2-xdb:contentHandling For an XML element, this annotation defines how to derive the value that will be inserted into the target column. You can chose the text value of just this element (text), the concatenation of this element’s text and the text of all its descendant nodes (stringValue), or the serialized XML (including all tages) of this element and all descendants (serializeSubtree). If you omit this annotation, DB2 chooses an appropriate default based on the nature of the respective element. db2-xdb:truncate Specifies whether a value should be truncated if its length is greater than the length of the target column. db2-xdb:normalization Specifies how to treat whitespace—valid values are whitespaceStrip, canonical, and original db2-xdb:expression Specifies an expression that is to be applied to the data before insertion into the target table.
11.3 Shredding with Annotated XML Schemas Table 11.5 309 XML Schema Annotations (Continued) Annotation Description db2-xdb:locationPath Filters based on the XML context. For example, if it is a customer address then shred to the cust table; if it is an employee address then shred to the employee table. db2-xdb:condition Specifies value conditions so that data is inserted into a target table only if all conditions are true. db2-xdb:rowSetMapping Enables users to specify multiple mappings, to the same or different tables, for an element or attribute. db2-xdb:table Maps multiple elements or attributes to a single column. db2-xdb:order Specifies the insertion order of rows among multiple tables. db2-xdb:rowSetOperationOrder Groups together multiple db2-xdb:order annotations. To demonstrate annotated schema decomposition we use the shredding scenario in Figure 11.1 as an example. Assume that the target tables have been defined as shown in Figure 11.1. An annotated schema that defines the desired mapping is provided in Figure 11.8. Let’s look at the lines that are highlighted in bold font. The first bold line declares the namespace prefix db2-xdb, which is used throughout the schema to distinguish DB2-specific annotations from regular XML Schema tags. The first use of this prefix is in the annotation db2-xdb:defaultSQLSchema, which defines the relational schema of the target tables. The next annotation occurs in the definition of the element name. The two annotation attributes db2-xdb:rowSet="ADDRESS" and db2-xdb:column="NAME" define the target table and column for the name element. Similarly, the street and city elements are also mapped to respective columns of the ADDRESS table. The next two annotations map the phone number and the type attribute to columns in the PHONES table. The last block of annotations belongs to the XML Schema definition of the Cid attribute. Since the Cid attribute value becomes the join key between the ADDRESS and the PHONE table, it has to be mapped to both tables. Two row set mappings are necessary, which requires the use of annotation elements instead of annotation attributes. The first db2-xdb:rowSetMapping maps the Cid attribute to the CID column in the ADDRESS table. The second db2-xdb:rowSet Mapping assigns the Cid attribute to the CID column in the PHONES table. <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" xmlns:db2-xdb="http://www.ibm.com/xmlns/prod/db2/xdb1" > <xs:annotation> <xs:appinfo> <db2-xdb:defaultSQLSchema>db2admin</db2-xdb:defaultSQLSchema> </xs:appinfo> </xs:annotation> Figure 11.8 Annotated schema to implement the shredding in Figure 11.1 (continues)
310 Chapter 11 Converting XML to Relational Data <xs:element name="customerinfo"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string" minOccurs="1" db2-xdb:rowSet="ADDRESS" db2-xdb:column="NAME"/> <xs:element name="addr" minOccurs="1" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="street" type="xs:string" minOccurs="1" db2-xdb:rowSet="ADDRESS" db2-xdb:column="STREET"/> <xs:element name="city" type="xs:string" minOccurs="1" db2-xdb:rowSet="ADDRESS" db2-xdb:column="CITY"/> <xs:element name="prov-state" type="xs:string" minOccurs="1" /> <xs:element name="pcode-zip" type="xs:string" minOccurs="1" /> </xs:sequence> <xs:attribute name="country" type="xs:string" /> </xs:complexType> </xs:element> <xs:element name="phone" minOccurs="0" maxOccurs="unbounded" db2-xdb:rowSet="PHONES" db2-xdb:column="PHONENUM"> <xs:complexType> <xs:simpleContent> <xs:extension base="xs:string"> <xs:attribute name="type" form="unqualified" type="xs:string" db2-xdb:rowSet="PHONES" db2-xdb:column="PHONETYPE"/> </xs:extension> </xs:simpleContent> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="Cid" type="xs:integer"> <xs:annotation> <xs:appinfo> <db2-xdb:rowSetMapping> <db2-xdb:rowSet>ADDRESS</db2-xdb:rowSet> <db2-xdb:column>CID</db2-xdb:column> </db2-xdb:rowSetMapping> <db2-xdb:rowSetMapping> <db2-xdb:rowSet>PHONES</db2-xdb:rowSet> <db2-xdb:column>CID</db2-xdb:column> </db2-xdb:rowSetMapping> </xs:appinfo> </xs:annotation> </xs:attribute> </xs:complexType> </xs:element> </xs:schema> Figure 11.8 Annotated schema to implement the shredding in Figure 11.1 (Continued)
11.3 Shredding with Annotated XML Schemas 11.3.2 311 Defining Schema Annotations Visually in IBM Data Studio You can add annotations to an XML Schema manually, using any text editor or XML Schema editor. Alternatively, you can use the Annotated XSD Mapping Editor in IBM Data Studio Developer. To invoke the editor, right-click on an XML Schema name and select Open With, Annotated XSD Mapping Editor. A screenshot of the mapping editor is shown in Figure 11.9. The left side of the editor shows the hierarchical document structure defined by the XML Schema (Source). The right side shows the tables and columns of the relational target schema (Target). You can add mapping relationships by connecting source items with target columns. There is also a discover function to find probable relationships. Mapped relationships are represented in the mapping editor by lines drawn between source elements and target columns. Figure 11.9 11.3.3 Annotated XSD Mapping Editor in Data Studio Developer Registering an Annotated Schema After you have created your annotated XML Schema you need to register it in the XML Schema Repository of the database. DB2’s XML Schema Repository is described in detail in Chapter 16, Managing XML Schemas. For the annotated schema in Figure 11.8 it is sufficient to issue the REGISTER XMLSCHEMA command with its COMPLETE and ENABLE DECOMPOSITION options as shown in Figure 11.10. In this example the XML Schema is assumed to reside in the file /xml/myschemas/cust2.xsd. Upon registration it is assigned the SQL identifier db2admin. cust2xsd. This identifier can be used to reference the schema later. The COMPLETE option of the command indicates that there are no additional XML Schema documents to be added. The option ENABLE DECOMPOSITION indicates that this XML Schema can be used not only for document validation but also for shredding.
312 Chapter 11 Converting XML to Relational Data REGISTER XMLSCHEMA 'http://pureXMLcookbook.org' FROM '/xml/myschemas/cust2.xsd' AS db2admin.cust2xsd COMPLETE ENABLE DECOMPOSITION; Figure 11.10 Registering an annotated XML schema Figure 11.11 shows that you can query the DB2 catalog view syscat.xsrobjects to determine whether a registered schema is enabled for decomposition (Y) or not (N). SELECT SUBSTR(objectname,1,10) AS objectname, status, decomposition FROM syscat.xsrobjects ; OBJECTNAME STATUS DECOMPOSITION ---------- ------ ------------CUST2XSD C Y Figure 11.11 Checking the status of an annotated XML schema The DECOMPOSITION status of an annotated schema is automatically changed to X (inoperative) and shredding is disabled, if any of the target tables are dropped or a target column is altered. No warning is issued when that happens and subsequent attempts to use the schema for shredding fail. You can also use the following commands to disable and enable an annotated schema for shredding: ALTER XSROBJECT cust2xsd DISABLE DECOMPOSITION; ALTER XSROBJECT cust2xsd ENABLE DECOMPOSITION; 11.3.4 Decomposing One XML Document at a Time After you have registered and enabled the annotated XML Schema you can decompose XML documents with the DECOMPOSE XML DOCUMENT command or with a built-in stored procedure. The DECOMPOSE XML DOCUMENT command is convenient to use in the DB2 Command Line Processor (CLP) while the stored procedure can be called from an application program or the CLP. The CLP command takes two parameters as input: the filename of the XML document that is to be shredded and the SQL identifier of the annotated schema, as in the following example: DECOMPOSE XML DOCUMENT /xml/mydocuments/cust01.xml XMLSCHEMA db2admin.cust2xsd VALIDATE; The keyword VALIDATE is optional and indicates whether XML documents should be validated against the schema as part of the shredding process. While shredding, DB2 traverses both the XML document and the annotated schema and detects fundamental schema violations even if the VALIDATE keyword is not specified. For example, the shredding process fails with an error if a
11.3 Shredding with Annotated XML Schemas 313 mandatory element is missing, even if this element is not being shredded and the VALIDATE keyword is omitted. Similarly, extraneous elements or data type violations also cause the decomposition to fail. The reason is that the shredding process walks through the annotated XML Schema and the instance document in lockstep and therefore detects many schema violations “for free” even if the XML parser does not perform validation. To decompose XML documents from an application program, use the stored procedure XDBDECOMPXML. The parameters of this stored procedure are shown in Figure 11.12 and described in Table 11.6. >>-XDBDECOMPXML--(--rschema--,--xmlschemaname--,--xmldoc--,----> >--documentid--,--validation--,--reserved--,--reserved--,------> >--reserved--)------------------------------------------------>< Figure 11.12 Table 11.6 Syntax and parameters of the stored procedure XDBDECOMPXML Description of the Parameters of the Stored Procedure XDBDECOMPXML Parameter Description rschema The relational schema part of the two-part SQL identifier of the annotated XML Schema. For example, if the SQL identifier of the XML Schema is db2admin.cust2xsd, then you should pass the string 'db2admin' to this parameter. In DB2 for z/OS this value must be either 'SYSXSR' or NULL. xmlschemaname The second part of the two-part SQL identifier of the annotated XML Schema. If the SQL identifier of the XML Schema is db2admin.cust2xsd, then you pass the string 'cust2xsd' to this parameter. This value cannot be NULL. xmldoc In DB2 for Linux, UNIX, and Windows, this parameter is of type BLOB(1M) and takes the XML document to be decomposed. In DB2 for z/OS this parameter is of type CLOB AS LOCATOR. This parameter cannot be NULL. documentid A string that the caller can use to identify the input XML document. The value provided will be substituted for any use of $DECOMP_DOCUMENTID specified in the db2-xdb:expression or db2-xdb:condition annotations. validation Possible values are: 0 (no validation) and 1 (validation is performed). This parameter does not exist in DB2 for z/OS. reserved Parameters reserved for future use. The values passed for these arguments must be NULL. These parameters do not exist in DB2 for z/OS.
314 Chapter 11 Converting XML to Relational Data A Java code snippet that calls the stored procedure using parameter markers is shown in Figure 11.13 CallableStatement callStmt = con.prepareCall( "call SYSPROC.XDBDECOMPXML(?,?,?,?,?, null, null, null)"); File xmldoc = new File("c:\mydoc.xml"); FileInputStream xmldocis = new FileInputStream(xmldoc); callStmt.setString(1, "db2admin" ); callStmt.setString(2, "cust2xsd" ); // document to be shredded: callStmt.setBinaryStream(3,xmldocis,(int)xmldoc.length() ); callStmt.setString(4, "mydocument26580" ); // no schema validation in this call: callStmt.setInt(5, 0); callStmt.execute(); Figure 11.13 Java code that invokes the stored procedure XDBDECOMPXML While the input parameter for XML documents is of type CLOB AS LOCATOR in DB2 for z/OS, it is of type BLOB(1M) in DB2 for Linux, UNIX, and Windows. If you expect your XML documents to be larger than 1MB, use one of the stored procedures listed in Table 11.7. These stored procedures are all identical except for their name and the size of the input parameter xmldoc. When you call a stored procedure, DB2 allocates memory according to the declared size of the input parameters. For example, if all of your input documents are at most 10MB in size, the stored procedure XDBDECOMPXML10MB is a good choice to conserve memory. Table 11.7 Stored Procedures for Different Document Sizes (DB2 for Linux, UNIX, and Windows) Stored Procedure Document Size Supported since XDBDECOMPXML ≤1MB DB2 9.1 XDBDECOMPXML10MB ≤10MB DB2 9.1 XDBDECOMPXML25MB ≤25MB DB2 9.1 XDBDECOMPXML50MB ≤50MB DB2 9.1 XDBDECOMPXML75MB ≤75MB DB2 9.1 XDBDECOMPXML100MB ≤100MB DB2 9.1 XDBDECOMPXML500MB ≤500MB DB2 9.5 FP3
11.3 Shredding with Annotated XML Schemas 315 Table 11.7 Stored Procedures for Different Document Sizes (DB2 for Linux, UNIX, and Windows) (Continued) Stored Procedure Document Size Supported since XDBDECOMPXML1GB ≤1GB DB2 9.5 FP3 XDBDECOMPXML1_5GB ≤1.5GB DB2 9.7 XDBDECOMPXML2GB ≤2GB DB2 9.7 For platform compatibility, DB2 for z/OS supports the procedure XDBDECOMPXML100MB with the same parameters as DB2 for Linux, UNIX, and Windows, including the parameter for validation. 11.3.5 Decomposing XML Documents in Bulk DB2 9.7 for Linux, UNIX, and Windows introduces a new stored procedure called XDB_DECOMP_XML_FROM_QUERY. It uses an annotated schema to decompose one or multiple XML documents selected from a column of type XML, BLOB, or VARCHAR FOR BIT DATA. The main difference to the procedure XDBDECOMPXML is that XDB_DECOMP_XML_FROM_QUERY takes an SQL query as a parameter and executes it to obtain the input documents from a DB2 table. For a large number of documents, a LOAD operation followed by a “bulk decomp” can be more efficient than shredding these documents with a separate stored procedure call for each document. Figure 11.14 shows the parameters of this stored procedure. The parameters commit_ count and allow_access are similar to the corresponding parameters of DB2’s IMPORT utility. The parameters total_docs, num_docs_decomposed, and result_report are output parameters that provide information about the outcome of the bulk shredding process. All parameters are explained in Table 11.8. >>--XDB_DECOMP_XML_FROM_QUERY--(--rschema--,--xmlschema--,--> >--query--,--validation--,--commit_count--,--allow_access--,----> >--reserved--,--reserved2--,--continue_on_error--,--------------> >--total_docs--,--num_docs_decomposed--,--result_report--)-->< Figure 11.14 The stored procedure XDB_DECOMP_XML_FROM_QUERY
316 Table 11.8 Chapter 11 Converting XML to Relational Data Parameters for XDB_DECOMP_XML_FROM_QUERY Parameter Description rschema Same as for XDBDECOMPXML. xmlschema Same as xmlschemaname for XDBDECOMPXML. query A query string of type CLOB(1GB), which cannot be NULL. The query must be an SQL or SQL/XML SELECT statement and must return two columns. The first column must contain a unique document identifier for each XML document in the second column of the result set. The second column contains the XML documents to be shredded and must be of type XML, BLOB, VARCHAR FOR BIT DATA, or LONG VARCHAR FOR BIT DATA. validation Possible values are: 0 (no validation) and 1 (validation is performed). commit_count An integer value equal to or greater than 0. A value of 0 means the stored procedure does not perform any commits. A value of n means that a commit is performed after every n successful document decompositions. allow_access A value of 1 or 0. If the value is 0, then the stored procedure acquires an exclusive lock on all tables that are referenced in the annotated XML Schema. If the value is 1, then the stored procedure acquires a shared lock. reserved, reserved2 These parameters are reserved for future use and must be NULL. continue_on _error Can be 1 or 0. A value of 0 means the procedure stops upon the first document that cannot be decomposed; for example, if the document does not match the XML Schema. total_docs An output parameter that indicates the total number of documents that the procedure tried to decompose. num_docs_ decomposed An output parameter that indicates the number of documents that were successfully decomposed. result_report An output parameter of type BLOB(2GB). It contains an XML document that provides diagnostic information for each document that was not successfully decomposed. This report is not generated if all documents shredded successfully. The reason this is a BLOB field (rather than CLOB) is to avoid codepage conversion and potential truncation/data loss if the application code page is materially different from the database codepage. Figure 11.15 shows an invocation of the XDB_DECOMP_XML_FROM_QUERY stored procedure in the CLP. This stored procedure call reads all XML documents from the info column of the customer table and shreds them with the annotated XML Schema db2admin.cust2xsd. The procedure commits every 25 documents and does not stop if a document cannot be shredded.
11.3 Shredding with Annotated XML Schemas 317 call SYSPROC.XDB_DECOMP_XML_FROM_QUERY ('DB2ADMIN', 'CUST2XSD', 'SELECT cid, info FROM customer', 0, 25, 1, NULL, NULL, '1',?,?,?) ; Value of output parameters -------------------------Parameter Name : TOTALDOCS Parameter Value : 100 Parameter Name : NUMDOCSDECOMPOSED Parameter Value : 100 Parameter Name : RESULTREPORT Parameter Value : x'' Return Status = 0 Figure 11.15 Calling the procedure SYSPROC.XDB_DECOMP_XML_FROM_QUERY If you frequently perform bulk shredding in the CLP, use the command DECOMPOSE XML DOCUMENTS instead of the stored procedure. It is more convenient for command-line use and performs the same job as the stored procedure XDB_DECOMP_XML_FROM_QUERY. Figure 11.16 shows the syntax of the command. The various clauses and keywords of the command have the same meaning as the corresponding stored procedure parameters. For example, query is the SELECT statement that provides the input documents, and xml-schema-name is the two-part SQL identifier of the annotated XML Schema. >>-DECOMPOSE XML DOCUMENTS IN----'query'----XMLSCHEMA-------> .-ALLOW NO ACCESS-. >--xml-schema-name--+----------+--+-----------------+-----------> '-VALIDATE-' '-ALLOW ACCESS----' >--+----------------------+--+-------------------+--------------> '-COMMITCOUNT--integer-' '-CONTINUE_ON_ERROR-' >--+--------------------------+-------------------------------->< '-MESSAGES--message-file-' Figure 11.16 Syntax for the DECOMPOSE XML DOCUMENTS command Figure 11.17 illustrates the execution of the DECOMPOSE XML DOCUMENTS command in the DB2 Command Line Processor. DECOMPOSE XML DOCUMENTS IN 'SELECT cid, info FROM customer' XMLSCHEMA db2admin.cust2xsd MESSAGES decomp_errors.xml ; DB216001I The DECOMPOSE XML DOCUMENTS command successfully decomposed all "100" documents. Figure 11.17 Example of the DECOMPOSE XML DOCUMENTS command
318 Chapter 11 Converting XML to Relational Data If you don’t specify a message-file then the error report is written to standard output. Figure 11.18 shows a sample error report. For each document that failed to shred, the error report shows the document identifier (xdb:documentId). This identifier is obtained from the first column that is produced by the SQL statement in the DECOMPOSE XML DOCUMENTS command. The error report also contains the DB2 error message for each document that failed. Figure 11.18 reveals that document 1002 contains an unexpected XML attribute called status, and that document 1005 contains an element or attribute value abc that is invalid because the XML Schema expected to find a value of type xs:integer. If you need more detailed information on why a document is not valid for a given XML Schema, use the stored procedure XSR_GET_PARSING_ DIAGNOSTICS, which we discuss in section 17.6, Diagnosing Validation and Parsing Errors. <?xml version='1.0' ?> <xdb:errorReport xmlns:xdb="http://www.ibm.com/xmlns/prod/db2/xdb1"> <xdb:document> <xdb:documentId>1002</xdb:documentId> <xdb:errorMsg>SQL16271N Unknown attribute "status" at or near line “1" in document "1002".</xdb:errorMsg> </xdb:document> <xdb:document> <xdb:documentId>1005</xdb:documentId> <xdb:errorMsg> SQL16267N An XML value "abc" at or near line "1" in document "1005" is not valid according to its declared XML schema type "xs:integer" or is outside the supported range of values for the XML schema type </xdb:errorMsg> </xdb:document> </xdb:errorReport> Figure 11.18 11.4 Sample error report from bulk decomp SUMMARY When you consider shredding XML documents into relational tables, remember that XML and relational data are based on fundamentally different data models. Relational tables are flat and unordered collections of rows with strictly typed columns, and each row in a table must have the same structure. One-to-many relationships are expressed by using multiple tables and join relationships between them. In contrast, XML documents tend to have a hierarchical and nested structure that can represent multiple one-to-many relationships in a single document. XML allows elements to be repeated any number of times, and XML Schemas can define hundreds or thousands of optional elements and attributes that may or may not exist in any given document. Due to these differences, shredding XML data to relational tables can be difficult, inefficient, and sometimes prohibitively complex.
11.4 Summary 319 If the structure of your XML data is of limited complexity such that it can easily be mapped to relational tables, and if your XML format is unlikely to change over time, then XML shredding can sometimes be useful to feed existing relational applications and reporting software. DB2 offers two methods for shredding XML data. The first method uses SQL INSERT statements with the XMLTABLE function. One such INSERT statement is required for each target table and multiple statements can be combined in a stored procedure to avoid repetitive parsing of the same XML document. The shredding statements can include XQuery and SQL functions, joins to other tables, or references to DB2 sequences. These features allow for customization and a high degree of flexibility in the shredding process, but require manual coding. The second approach for shredding XML data uses annotations in an XML Schema to define the mapping from XML to relational tables and columns. IBM Data Studio Developer provides a visual interface to create this mapping conveniently with little or no manual coding.
This page intentionally left blank
C H A P T E R 12 Updating and Transforming XML Documents T his chapter describes techniques to update and transform XML documents. DB2 pureXML supports three general techniques for modifying XML documents: • Full document replacement, which allows an application to replace an existing XML document with an updated document. It is up to the application to provide the new document. For DB2, this is a full-document operation. See section 12.1. • The XQuery Update Facility, which is a standardized extension to XQuery that allows you to modify, insert, or delete individual elements and attributes within an XML document. Such updates are also known as subdocument level updates, node-level updates, or partial document updates. See sections 12.2 through 12.13. • Extensible Stylesheet Language Transformation (XSLT), which lets you apply a style sheet to an XML document to transform it into a different XML format, HTML format, or some other user-defined format. See section 12.14. The discussion in this chapter assumes that you are familiar with querying XML data as described in Chapters 6 through 9. When you update an XML document you can optionally validate it against an XML Schema, either explicitly with the XMLVALIDATE function or automatically with a trigger. If the updated document does not comply with the specified XML Schema, the update fails. The validation of XML documents upon insert and update is explained in Chapter 17, Validating XML Documents against XML Schemas. 321
322 12.1 Chapter 12 Updating and Transforming XML Documents REPLACING A FULL XML DOCUMENT You can use regular relational SQL UPDATE statements to replace a full XML document in a table with a new document. This treats the XML document as a “black box” and does not modify individual elements or attributes. Your application needs to provide the new document, possibly after reading an existing document and modifying it in the application. An SQL UPDATE statement often has a WHERE clause to qualify one or more specific rows that you want to update. When you replace one document with another, the UPDATE statement typically needs to select a single row. Otherwise multiple existing documents might get replaced by the same new document, which is typically not what you want. You can use relational predicates, XML predicates, or any combination of those to select the appropriate row in which you want to update a document. Chapters 6 and 7 on querying XML data provide many examples of such predicates. In this section we look at UPDATE statements that perform full-document replacement with various different predicates to select the appropriate document: • Predicate on the relational columns of the table • Predicate on an XML element value • Predicate on an XML attribute value • Predicates on XML and relational values We also show how to provide new documents via parameter markers and how to remove existing documents by replacing them with a relational NULL value. The UPDATE statement in Figure 12.1 replaces the existing XML document in the info column with a new XML document, but only in the row where the relational column cid has the value 1000. The SET clause of the UPDATE statement performs the assignment of the new document to the XML column info in the selected row. UPDATE customer SET info = '<?xml version="1.0" encoding="UTF-8" ?> <customerinfo Cid="1010"> <name>Larry Trotter</name> <addr country="England"> <street>5 Rosewood</street> <city>Winchester</city> <prov-state>Hampshire</prov-state> <pcode-zip>HU16 6666</pcode-zip> </addr> <phone type="work">416-555-1358</phone> </customerinfo>' WHERE cid = 1000; Figure 12.1 Replacing a full XML document based on a relational value
12.1 Replacing a Full XML Document 323 In the example in Figure 12.1, the new XML document is provided as a literal value. More commonly you will use UPDATE statements in your application with parameter markers or host variables that carry the new XML document, as shown in Figure 12.2. UPDATE customer SET info = ? WHERE cid = 1000 UPDATE customer SET info = :hvar WHERE cid = 1000 Figure 12.2 Full document replacement with parameter marker or host variable You can also use parameter markers or host variables to provide the relational value by which you select the row where you want to replace the XML document (see Figure 12.3). UPDATE customer SET info = ? WHERE cid = ? UPDATE customer SET info = :hvar1 WHERE cid = :hvar2 Figure 12.3 Full document replacement with parameter markers or host variables Figure 12.4 shows UPDATE statements that use various different WHERE clauses to select the document that gets replaced in the SET clause of the statement. The new document is provided through a parameter marker, but it could be a literal document as shown previously in Figure 12.1. The first UPDATE in Figure 12.4 replaces the document where the name element has the value Larry Trotter. Ideally there should be only one such document in the table. Otherwise all documents where the name is Larry Trotter are replaced. Also, remember that the square brackets in the XMLEXISTS predicate are important. If you omit them, all rows qualify and are updated. The second UPDATE statement in Figure 12.4 uses a conjunction of an XML predicate and a relational predicate to qualify the document to be replaced. The third UPDATE statement uses an XML predicate on the attribute Cid in the XML data. The fourth UPDATE statement replaces the document of the customer whose work phone number is 416-555-1358. Note that the four UPDATE statements only differ in their WHERE clauses, which you can code in many different ways to select the desired document for replacement. UPDATE customer SET info = ? WHERE XMLEXISTS('$INFO/customerinfo[name = "Larry Trotter"]'); UPDATE customer SET info = ? WHERE XMLEXISTS('$INFO/customerinfo[name = "Larry Trotter"]') AND cid = 1000; UPDATE customer SET info = ? WHERE XMLEXISTS('$INFO/customerinfo[@Cid = 1005]'); Figure 12.4 Full XML document replacement with various different WHERE clauses
324 Chapter 12 Updating and Transforming XML Documents UPDATE customer SET info = ? WHERE XMLEXISTS('$INFO/customerinfo/phone[@type="work" and text()="416-555-1358"]'); Figure 12.4 Full XML document replacement with various different WHERE clauses (Continued) If you run the UPDATE statements shown in Figure 12.4 in DB2 for z/OS, remember that the XMLEXISTS predicate always requires a PASSING clause, like this: …WHERE XMLEXISTS('$i/customerinfo[@Cid = 1005]' PASSING info AS "i") You can also replace an existing XML document with a NULL value, which removes the document from the row without deleting the row: UPDATE customer SET info = NULL WHERE cid = 1000 The disadvantage of full document replacement is that it is left to the application to provide a new and possibly updated document. Often this means that an application has to read a document from the database, parse and modify the document using application code, and then execute one of the UPDATE statements discussed in this section to replace the original document with the updated one. This process requires dedicated application logic as well as moving XML documents back and forth between the application and the DB2 server. This can be improved with XQuery Updates, which are discussed next. 12.2 MODIFYING DOCUMENTS WITH XQUERY UPDATES In many situations you want to make modifications to your XML documents and not just replace one document with another. Since version 9.5, DB2 for Linux, UNIX, and Windows supports the XQuery Update Facility, a standardized extension to XQuery that allows you to modify, insert, or delete individual elements and attributes within an XML document. These capabilities make updating XML data easier and provide better performance than performing full document replacements. XQuery Updates allow you to modify individual XML nodes, such as elements and attributes, in the following ways: • Replace the value of a node • Replace a node with a new one • Insert a new node (at a specific location, such as before or after a given node)
12.2 Modifying Documents with XQuery Updates 325 • Delete a node • Rename a node • Modify multiple nodes in a document in a single statement • Update multiple documents in a single statement To perform updates of an XML document, use the XQuery transform expression. This expression can start with the optional transform keyword and consists of three clauses: the copy clause, the modify clause, and the return clause (Figure 12.5). The intuitive idea of the transform expression is that the copy clause assigns an input document from an XML column to a variable, then the modify clause applies one or more modifications to that variable, and finally the return clause produces the result of the transform expression. XQuery is a case-sensitive language and all keywords have to be in lowercase, including copy, modify, and return. .-transform-. >>-+------------+--------------------------------------------------> >--copy----$VariableName--:=--CopySourceExpression-+---------------> >--modify--ModifyExpression----------------------------------------> >--return--ReturnExpression----------------------------------------| Figure 12.5 High-level syntax of the transform expression Such XML modifications can be performed in an SQL UPDATE statement, in a query, or as part of an INSERT statement (Figure 12.6). If you modify a document in a query, the query reads the document from an XML column, changes it on-the-fly, and returns the modified document to the application. This leaves the original version of the document in the DB2 table unchanged. If you modify a document in an UPDATE statement, you make a permanent change to the data that is stored in DB2. Such an UPDATE is logged in the DB2 transaction log and subject to all the transaction management concepts that also apply to relational updates, such as commit, rollback, and recovery, when applicable. Concurrency control (locking) and logging happens at the full document level. You can also modify a new document at insert time if you include an XQuery transform expression in an SQL INSERT statement.
326 Chapter 12 Modify a document as part of a query. The original document in the database is not changed. Make a permanent change to a document in the database. This UPDATE is logged. XML Document Updating a stored document Figure 12.6 Updating and Transforming XML Documents Modify a new document during INSERT. The modified document is inserted and logged. XML Document XML Document XML Document XML Document Updating a returned document upon retrieval. Updating a new document upon insert. Three ways of modifying XML documents The concepts of changing XML element or attribute values, inserting new elements, renaming elements, and so on are independent from whether you do this in an UPDATE statement, in a query, or in an INSERT statement. The following sections describe the capabilities of the XQuery transform expressions and their usage in SQL UPDATE statements. Sections 12.10 and 12.11 then show how the same document modifications can be performed in queries and INSERT statements. 12.3 UPDATING THE VALUE OF AN XML NODE IN A DOCUMENT A simple and common kind of XML update is to change the value of a specific element or attribute node in an XML document. 12.3.1 Replacing an Element Value As an example, assume you have to update the address of a customer to change the value of the street element to “43 WestCreek”. Figure 12.7 shows the original document on the left and the desired updated document on the right. Original document Updated document <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <phone type="work">905-555-7258</phone> </customerinfo> <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="Canada"> <street>43 WestCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <phone type="work">905-555-7258</phone> </customerinfo> Figure 12.7 Changing the value of an element
12.3 Updating the Value of an XML Node in a Document 327 The UPDATE statement that performs the desired modification of the document is shown in Figure 12.8. It assumes that the document to be updated resides in the info column of the customer table in a row with the relational cid value 1002. The SET clause of the UPDATE statement assigns a new value to the XML column info. This new value is produced by the XMLQUERY function, which contains an XQuery transform expression. The copy clause refers to the original XML column value ($INFO), and assigns the original document to the variable $mycust. Subsequently, the modify clause manipulates this variable. The modify clause contains the update operation replace value of to replace the value of the element street with the new string literal “43 WestCreek”. Finally, the variable $mycust, which contains the modified document, is returned in the return clause of the transform expression. UPDATE customer SET info = XMLQUERY(' transform copy $mycust := $INFO modify do replace value of $mycust/customerinfo/addr/street with "43 WestCreek" return $mycust ') WHERE cid = 1002 Figure 12.8 Update statement to replace the value of an element In Figure 12.8 and many other typical update cases, the right side of the copy clause is just the variable that refers to the original document, in this case $INFO. The right side of the copy clause could be a more complex expression, but it must always evaluate to a single node. It cannot be an empty sequence or a sequence of more than one item. This single node can have descendants, which means it can be (and often is) the root of a full XML document. In many update examples you will also see that the return clause simply returns the variable that holds the modified document. However, the return clause could contain a more complex expression, including element construction or a FLWOR expression. Updates with more complex expressions in the copy and the return clauses are discussed in section 12.10. Since the transform keyword is optional, it is omitted from here on. 12.3.2 Replacing an Attribute Value Replacing an attribute value is just as easy as replacing an element value. The UPDATE statement in Figure 12.9 changes the Cid attribute to the new value 1099. The entire UPDATE statement is the same as in Figure 12.8 except that the path to the target node and the new value are different. The literal value 1099 could be in double quotes but does not have to be because it can be interpreted as a number.
328 Chapter 12 Updating and Transforming XML Documents UPDATE customer SET info = XMLQUERY(' copy $mycust := $INFO modify do replace value of $mycust/customerinfo/@Cid with 1099 return $mycust ') WHERE cid = 1002 Figure 12.9 12.3.3 Replacing the value of an attribute Replacing a Value Using a Parameter Marker Often you will want to prepare and compile an UPDATE statement only once, and then pass in a new value every time you execute it. This avoids recompiling the statement in the database server for each execution. The mechanism to use parameters is the same as for SQL/XML queries. The PASSING clause of the XMLQUERY function allows you to pass a SQL-style parameter marker (“?”) as a variable ($z) into the XQuery expression (Figure 12.10). Note that XQuery variables are case-sensitive. For example, $z and $Z are not the same. The query in Figure 12.10 also uses a parameter marker in the WHERE clause to select the row to be updated. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace value of $newinfo/customerinfo/phone with $z return $newinfo' PASSING CAST(? AS VARCHAR(15)) AS "z") WHERE cid = ? Figure 12.10 Updating XML values with parameter markers You can run the UPDATE statement in Figure 12.10 from an application, such as a Java program. You would use JDBC statements to prepare and compile the statement, bind a value from an application variable to the parameter marker, and then execute the statement. 12.3.4 Replacing Multiple Values in a Document You can update multiple values in the same document in a single UPDATE statement. Figure 12.11 illustrates that the modify clause allows for a comma-separated list of update operations. The entire list is enclosed in parentheses. This enables you to easily combine two or more update operations in a single statement.
12.3 Updating the Value of an XML Node in a Document 329 UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify (do replace value of $newinfo/customerinfo/addr/street with "85 Leicester Rd" , do replace value of $newinfo/customerinfo/addr/pcode-zip with "W7B 8X1" ) return $newinfo ') WHERE cid = 1002 Original document Updated document <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <phone type="work">905-555-7258</phone> </customerinfo> <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="Canada"> <street>85 Leicester Rd</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>W7B 8X1</pcode-zip> </addr> <phone type="work">905-555-7258</phone> </customerinfo> Figure 12.11 Updating multiple values in a single UPDATE statement If you want to update multiple values in a single UPDATE statement and use parameter markers for all values, the PASSING clause of the XMLQUERY function needs to contain a list of typed parameter markers together with the variable names that refer to them (see Figure 12.12). UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify (do replace value of $newinfo/customerinfo/addr/street with $str, do replace value of $newinfo/customerinfo/addr/pcode-zip with $zip ) return $newinfo' PASSING CAST(? AS VARCHAR(30)) AS "str", CAST(? AS VARCHAR(10)) AS "zip") WHERE cid = 1002 Figure 12.12 12.3.5 Updating multiple values with parameter markers Replacing an Existing Value with a Computed Value The value that you use to update an existing element or attribute does not necessarily have to be a fixed value but can be computed based on the existing values in the document. For example,
330 Chapter 12 Updating and Transforming XML Documents assume that the customer documents can contain an element numorders that tracks the total number of orders that a customer has placed. The UPDATE statement in Figure 12.13 increments the value of the element numorders by 1. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace value of $newinfo/customerinfo/numorders with $newinfo/customerinfo/numorders + 1 return $newinfo ') WHERE cid = 1002 Original document Updated document <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <numorders>16</numorders> </customerinfo> </addr> <numorders>17</numorders> </customerinfo> Figure 12.13 Incrementing the numeric value of an element Similarly, the UPDATE statement in Figure 12.14 modifies the value of the element street by appending an apartment number. It uses the XQuery function concat. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace value of $newinfo/customerinfo/addr/street with concat($newinfo/customerinfo/addr/street, " Apt #4") return $newinfo ') WHERE cid = 1002 Figure 12.14 Appending an apartment number to the street If you write more elaborate updates, you might find it tedious to repeat a long path such as $newinfo/customerinfo/addr/street whenever you reference an existing node in the document. Figure 12.15 uses a let clause to assign this long path to the variable $s. Subsequently, the do replace value clause uses $s multiple times instead of repeating the long path. Note that the modify clause contains a FLWOR expression that only consists of the let and the return clause while the for, where, and order by clauses are omitted. Hence, the XQuery expression
12.4 Replacing XML Nodes in a Document 331 in Figure 12.15 also contains two return clauses. The first one belongs to the let and its FLWOR expression (bold font), and the second one is the return of the transform expression. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify let $s := $newinfo/customerinfo/addr/street return do replace value of $s with concat($s, " Apt #4") return $newinfo ') WHERE cid = 1002 Figure 12.15 12.4 Using let to assign a long path to a short variable REPLACING XML NODES IN A DOCUMENT Suppose a customer has moved to a different city and you need to update the address in the XML document that holds the customer’s information. You could write an UPDATE statement with replace value of expressions to individually change the values of all elements and attributes that make up the address of the customer (country, street, city, prov-state, and pcodezip). However, such an update can be lengthy and tedious to write. It can be a lot easier to simply replace the existing addr element and all of its children with a new addr element. Such a replacement of a node is done with a replace expression. The replace expression works differently from the replace value of expression. The former replaces the whole node (the old node is deleted), whereas the latter replaces only the value of the target node. Figure 12.16 shows an UPDATE statement that replaces the existing addr element and all of its child nodes with a new addr fragment. The structure of the new XML fragment does not have to be identical to the original one. Indeed, the new address in Figure 12.16 contains the elements state and zipcode, which are different from the original address. Similarly, you could decide to replace the original addr element and all of its children, with a single email element, if you wanted to. If you choose to validate updated documents with an XML Schema, the new structure of the document has to conform with the XML Schema.
332 Chapter 12 Updating and Transforming XML Documents UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace $newinfo/customerinfo/addr with <addr country="United States"> <street>555 Bailey Avenue</street> <city>San Jose</city> <state>California</state> <zipcode>95141</zipcode> </addr> return $newinfo ') WHERE cid = 1002 Original document Updated document <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <phone type="work">905-555-7258</phone> </customerinfo> <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="United States"> <street>555 Bailey Avenue</street> <city>San Jose</city> <state>California</prov-state> <zipcode>95141</zipcode> </addr> <phone type="work">905-555-7258</phone> </customerinfo> Figure 12.16 Replacing an element node Note that the new addr fragment in the modify clause of the UPDATE statement in Figure 12.16 is not enclosed in single quotes because it is not a string value. Instead, the new addr element and its children are constructed with direct element and attribute constructors (see section 8.4, Constructing XML Data). The XML value that provides the new address can also be computed with an expression. For example, Figure 12.17 uses an XPath expression to obtain the addr element from the customer whose Cid attribute has the value 1004. This address element replaces the address of customer 1002. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace $newinfo/customerinfo/addr with db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo[@Cid=1004]/addr return $newinfo ') WHERE cid = 1002 Figure 12.17 Updating multiple values in a single UPDATE statement
12.5 Deleting XML Nodes from a Document 12.5 333 DELETING XML NODES FROM A DOCUMENT This section describes how to delete elements or attributes from a document. As an example, suppose that a phone number of a customer is invalid and you want to remove the entire phone element from the corresponding XML document. Figure 12.18 shows a first attempt at writing an appropriate UPDATE statement. It looks much like the previous UPDATE statements except that the updating expression is delete instead of replace value of. In the delete expression, simply specify the path to the elements or attributes that you want to remove from the document. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do delete $newinfo/customerinfo/phone return $newinfo') WHERE cid = 1003 Original document Updated document <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> </customerinfo> <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> </customerinfo> Figure 12.18 Deleting an element The document that is being updated in Figure 12.18 contains multiple phone elements, and the delete expression removes all of them. If you don’t want to delete all occurrences of a repeating element, add a predicate to the target path to delete only selected occurrences. For example, the following delete expression removes a phone element only if its type attribute has the value home: do delete $newinfo/customerinfo/phone[type="home"] This delete expression removes exactly one phone element from the original document in Figure 12.18, and leaves the other two phone elements untouched. In general, this expression can delete zero, one, or multiple phone elements from a document, depending on how many phone elements with type equal to home occur in a given document. Modifying repeating elements is further discussed in section 12.8.
334 Chapter 12 Updating and Transforming XML Documents Predicates in the update expression only serve to select nodes within any given document. They do not help you to efficiently find the documents that should be updated. Predicates that select documents for update must be placed in the WHERE clause of the SQL UPDATE statement.They can include XMLEXISTS predicates. NOTE If you want to delete an attribute, such as country, simply use a delete expression with an XPath that points to the attribute: do delete $newinfo/customerinfo/addr/@country You can also remove an entire XML fragment from an XML document. For example, the statement in Figure 12.19 deletes the entire addr element including all the child elements and attributes it contains. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do delete $newinfo/customerinfo/addr return $newinfo') WHERE cid = 1002 Original document Updated document <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <phone type="work">905-555-7258</phone> </customerinfo> <customerinfo Cid="1002"> <name>Jim Noodle</name> <phone type="work">905-555-7258</phone> </customerinfo> Figure 12.19 12.6 Deleting an XML fragment RENAMING ELEMENTS OR ATTTRIBUTES IN A DOCUMENT The rename expression enables you to change the name of an element or attribute. For example, the statement in Figure 12.20 renames the addr element to address. The new element name address is a string literal and must be enclosed in double quotes.
12.7 Inserting XML Nodes into a Document 335 UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do rename $new/customerinfo/addr as "address" return $new ') WHERE cid = 1002 Original document Updated document <customerinfo Cid="1002"> <name>Jim Noodle</name> < addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </ addr > <phone type="work">905-555-7258</phone> </customerinfo> <customerinfo Cid="1002"> <name>Jim Noodle</name> <address country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </address> <phone type="work">905-555-7258</phone> </customerinfo> Figure 12.20 Changing an element name DB2 never allows you to update a document in a manner that violates the rules for well-formed XML documents. For example, in an element such as <product xid="15" yid="107"> you cannot rename the attribute xid to yid. This update operation is rejected because it would produce an element with two attributes that have the same name (yid), which is not permitted in any XML document. 12.7 INSERTING XML NODES INTO A DOCUMENT This section describes how to add element or attribute nodes to a document. When you insert a new element or attribute into a document, you must specify the target position of the new node in the document. We first discuss the positioning of inserted elements, then the positioning of inserted attributes, and then look at several examples. 12.7.1 Defining the Position of Inserted Elements Suppose you want to insert the new element <email>jnoodle@ibm.com</email> into the XML document for customer Jim Noodle. You have to decide which existing element is going to be the parent for the new email element. For example, you might decide that email is going to be a child element of the root element customerinfo. This makes email a sibling of the elements name, addr, and phone. Then you can further choose the position of the email element among its siblings. For example, should email appear before or after the addr element? Alternatively, you could decide that email is going to be a child element of addr and therefore becomes a sibling of street, city, prov-state, and pcode-zip. The insert operation in the modify clause allows you to add new nodes to an XML document. It offers five ways to specify the position of the new node: into, as last into, as first
336 Chapter 12 Updating and Transforming XML Documents into, after, and before. Examples of using these five options for a new element are listed in Table 12.1. Table 12.1 Five Options for Inserting an Element into a Document Insert Operation Position of the Inserted Node insert <email>jnoodle@ibm.com</email> into $new/customerinfo email becomes a child element of customerinfo. The position of email among the existing children of customerinfo is nondeterministic. insert <email>jnoodle@ibm.com</email> as last into $new/customerinfo email becomes the last child element of customerinfo. insert <email>jnoodle@ibm.com</email> as first into $new/customerinfo email becomes the first child element of customerinfo. insert <email>jnoodle@ibm.com</email> after $new/customerinfo/addr email becomes a sibling of addr and therefore a child of customerinfo. email appears immediately after addr. insert <email>jnoodle@ibm.com</email> before $new/customerinfo/addr email becomes a sibling of addr and a child of customerinfo. email appears immediately before addr. The path that defines the target location of the insert, such as $new/customerinfo or $new/customerinfo/addr, has to produce exactly one node. If the path does not exist in the document or if it exists more than once, the operation fails with error SQL16085N. If you look up the explanation for SQL16085N you find that a common reason is described as “the target node of an insert expression is not a single element node or document node.” Beware that the words “not a single element node” do not necessarily imply that more than one target node was found. It’s equally possible that no target node was found. “Not a single element” means that either zero or more than one node was found, so you should check for both cases when you encounter error SQL16085N. For example, if you misspell a tag name in the target path, error SQL16085N is raised because no target node was found. 12.7.2 Defining the Position of Inserted Attributes To insert a new attribute instead of an element, you have to use a computed attribute constructor. It consists of the keyword attribute followed by the attribute name and an expression or constant that provides the attribute value. The same five insert options are available as for elements and are shown in Table 12.2. The difference for attributes is that the operations into $new/ customerinfo, as last into $new/customerinfo, and as first into $new/ customerinfo all have the same effect. Their effect is that the new attribute becomes an attribute of the element customerinfo. Since the XML data model does not define a positional order
12.7 Inserting XML Nodes into a Document 337 among the attributes of an element, attributes are always unordered. Therefore the keywords last, first, before, and after do not affect the position of attributes. If you insert an attribute before or after $new/customerinfo/addr, the attribute becomes a sibling of addr and is therefore added to the parent of addr, which is customerinfo. Table 12.2 Five Options for Inserting a Attribute into a Document Insert Operation Position of the Inserted Node insert attribute email {"jnoodle@ibm.com"} into $new/customerinfo In all three cases, email becomes an attribute of customerinfo. The position of email among the existing attributes is undefined because attributes are not ordered. insert attribute email {"jnoodle@ibm.com"} as last into $new/customerinfo insert attribute email {"jnoodle@ibm.com"} as first into $new/customerinfo insert attribute email {"jnoodle@ibm.com"} after $new/customerinfo/addr In both cases, email becomes an attribute of the parent of addr, which is customerinfo. insert attribute email {"jnoodle@ibm.com"} before $new/customerinfo/addr 12.7.3 Insert Examples For the following examples, assume that an email element has to be inserted into the XML document for Robert Shoemaker. This document is identified by the relational cid value 1003. Figure 12.21 shows a first attempt at performing this update. The UPDATE statement fails with errors message SQL20345N because the target path is specified as $new instead of $new/customerinfo. When the target path is $new, the email element is inserted as a sibling and not as a child of the customerinfo element. The result is a sequence of two elements (customerinfo, email), which is not a well-formed XML document. Since XML columns can only contain wellformed documents, the update fails. It fails for the same reason if you specify before $new/ customerinfo or after $new/customerinfo as the target position.
338 Chapter 12 Updating and Transforming XML Documents UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert <email>robert@ibm.com</email> as last into $new return $new') WHERE cid = 1003 SQL20345N The XML value is not a well-formed document with a single root element. SQLSTATE=2200L Original document Rejected XML value <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> </customerinfo> <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> </customerinfo> <email>robert@ibm.com</email> Figure 12.21 Cannot insert an element as a sibling of the root element Figure 12.22 shows the corrected UPDATE statement and the correctly modified XML document. You could similarly insert the email element as first into $new/customerinfo. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert <email>robert@ibm.com</email> as last into $new/customerinfo return $new') WHERE cid = 1003 Original document Updated document <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> </customerinfo> <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> <email>robert@ibm.com</email> </customerinfo> Figure 12.22 Inserting a new element as the last element
12.7 Inserting XML Nodes into a Document 339 If you want the email element to appear in the document before the phone elements, you can explicitly request it to be inserted before the first occurrence of any existing phone elements using the positional predicate [1]. This is shown in Figure 12.23 where the positional predicate selects exactly one phone element as the target location. If you omit the positional predicate, the UPDATE statement fails with error SQL16085N. The statement in Figure 12.23 would also fail if the document contained no phone elements. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert <email>robert@ibm.com</email> before $new/customerinfo/phone[1] return $new') WHERE cid = 1003 Original document Updated document <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> </customerinfo> <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <email>robert@ibm.com</email> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> </customerinfo> Figure 12.23 Inserting a new element before an existing element If you want to insert the email element after the last phone element but before any other elements that might appear at end of the document, specify the insert position to be after $new/customerinfo/phone[last()]. As another example, Figure 12.24 shows an UPDATE statement that inserts the new email element as the first child of the addr element. Alternatively, the UPDATE statement in Figure 12.25 inserts the email address as an attribute of the addr element. In the updated document, the attribute email happens to appear before the attribute country. But this order is not relevant and not guaranteed because XML attributes have no defined order. If you change the target position of the inserted attribute to after $new/customerinfo/ addr/city or before $new/customerinfo/addr/@country, the updated document is still the same as shown in Figure 12.25.
340 Chapter 12 Updating and Transforming XML Documents UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert <email>robert@ibm.com</email> as first into $new/customerinfo/addr return $new') WHERE cid = 1003 Original document Updated document <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> </customerinfo> <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <email>robert@ibm.com</email> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> </customerinfo> Figure 12.24 Inserting a new element as the first child element of a target node UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert attribute email {"robert@ibm.com"} into $new/customerinfo/addr return $new') WHERE cid = 1003 Original document Updated document <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> </customerinfo> <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr email="robert@ibm.com" country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> </customerinfo> Figure 12.25 12.8 Inserting an attribute HANDLING REPEATING AND MISSING NODES If a single XPath expression identifies multiple nodes in a single document, they are called repeating nodes. In previous sections you saw that the XML document for Robert Shoemaker contains multiple phone elements. Hence, the element phone is a repeating element and the path /customerinfo/phone produces a sequence of more than one element node.
12.8 Handling Repeating and Missing Nodes 341 As defined by the XQuery Update standard, the delete expression is the only update operation that can directly process multiple occurrences of a node. It simply deletes all of them, as you saw in section 12.5. All other update expressions (replace, replace value of, rename, and insert) require special attention when dealing with repeating nodes. The same applies to missing nodes. If you try to delete an element or attribute that does not exist, the delete expression performs no action and returns successfully. However, all other update expressions fail when they try to modify an element or attribute that does not exist in the target document. The UPDATE statement in Figure 12.26 tries to change the value of a phone element but fails. At runtime, DB2 detects that there is more than one phone element in the target document and returns error SQL16085N. You can type “? SQL16085N” at the DB2 command prompt to find that the explanation for reason code XUTY0008 is that “the target node of a replace expression is not a single node”. This reason code indicates that the target path $new/customerinfo/phone has either produced multiple phone elements or none. However, it must produce exactly one node for the update to be successful. The error prevents you from updating multiple phone elements with the same number, which would not make sense. If no phone element exists, the error ensures that you are not led to believe that the new phone number was successfully written to the document. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do replace value of $new/customerinfo/phone with "123-456-7890" return $new ') WHERE cid = 1003 SQL16085N The target node of an XQuery "replace value of" expression is not valid. Error QName=err:XUTY0008. SQLSTATE=10703. Figure 12.26 Trying to replace the value of a repeating element If you know that there are multiple phone elements, a common way to avoid error SQL16085N is to add a predicate to the target path to select exactly one phone element for update. As an example, Figure 12.27 uses the predicate [@type="cell"] to only update the cell phone number. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do replace value of $new/customerinfo/phone[@type="cell"] with "123-456-7890" return $new ') WHERE cid = 1003 Figure 12.27 Replacing one of multiple occurrences of an element
342 Chapter 12 Updating and Transforming XML Documents Using the predicate in Figure 12.27 works well if every possible target document contains exactly one phone element with a type attribute equal to cell. However, if a document does not contain a cell phone element, the UPDATE statement in Figure 12.27 still fails with error SQL16085N. In that case, another option is to use the XQuery if-then-else expression, as shown in Figure 12.28. If a cell phone element exists then its value is replaced with a new value, else a new cell phone element with the new number is inserted. This implements an “upsert” operation. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify if ($new/customerinfo/phone[@type="cell"]) then do replace value of $new/customerinfo/phone[@type="cell"] with "123-456-7890" else do insert <phone type="cell">123-456-7890</phone> as last into $new/customerinfo return $new ') WHERE cid = 1001 Figure 12.28 Conditional update and insert of an element The most resilient solution for handling both repeating and missing elements is a FLWOR expression in the modify clause (see Figure 12.29). The for clause iterates over the target elements one at a time, so that the replace value of expression in the return clause is always applied to exactly one element. If you remove the condition where $j/@type = "cell", all phone elements are updated with the same number "123-456-7890", regardless of their type. If a document does not contain a cell phone or no phone elements at all, the return clause of the FLWOR expression is never invoked so that the replace value of expression never fails due to a missing node. In summary, the FLWOR expression in the modify clause enables an UPDATE statement to • Modify multiple or all occurrences of a repeating node (without warning) • Add predicates to select which occurrences of a repeating node to modify • Silently proceed and return successfully even if a target node is not found
12.9 Modifying Multiple XML Nodes in the Same Document 343 UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify for $j in $new/customerinfo/phone where $j/@type = "cell" return do replace value of $j with "123-456-7890" return $new') WHERE cid = 1000 Original document Updated document <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> </customerinfo> <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <street>845 Kean Street</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">123-456-7890</phone> </customerinfo> Figure 12.29 12.9 Iterating over the occurrences of a repeating element MODIFYING MULTIPLE XML NODES IN THE SAME DOCUMENT You can have multiple update operations for the same document in the modify clause of a single UPDATE statement. However, you cannot rename, replace, or update the value of the same node more than once. In this section we discuss examples where multiple combined update operations are or are not in conflict with each other. 12.9.1 Snapshot Semantics and Conflict Situations The XQuery Update standard defines that all update operations in the modify clause are applied independently from each other to the original document. They don’t see each others’ effects. This is called snapshot semantics, which means that each update operation is logically applied to a separate snapshot of the original document. As an example, let’s look at the UPDATE statement in Figure 12.30, which contains two updating expressions in the modify clause, separated by a comma. The first expression inserts an additional phone element. The second expression deletes all phone elements. The obvious question is whether the newly inserted phone element is instantly removed by the delete expression, and whether that depends on the order in which the insert and the delete operations appear in the modify clause. As it turns out, the new phone element is not affected by the delete expression, irrespective of the order in which the operations appear in the modify clause. Due to snapshot
344 Chapter 12 Updating and Transforming XML Documents semantics, both the insert and the delete expressions in Figure 12.30 are independently applied to a snapshot of the original document. Therefore the delete expression does not see the newly inserted phone element and only removes the old phone elements that existed in the document prior to this update. Hence, there is no conflict between the insert and the delete expression in Figure 12.30. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify( do insert <phone type="cell">777-555-3333</phone> after $new/customerinfo/addr , do delete $new/customerinfo/phone ) return $new ') WHERE cid = 1002 Original document <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <phone type="work">905-555-7258</phone> </customerinfo> Figure 12.30 Updated document <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <phone type="cell">777-555-3333</phone> </customerinfo> Combining an insert and a delete operation For comparison, let’s look at a different combination of an insert and a delete expression in Figure 12.31. One of the expressions deletes the addr element, and the other expression inserts a new POBox element into the addr element. Again, the order of the two operations in the modify clause is irrelevant. Nevertheless, the two operations conflict with each other because the delete expression removes the parent element (addr) of the newly inserted POBox element. For this case, the language standard defines that delete “wins” over insert and the updated document has no addr or POBox elements. Be aware of these effects when you code complex updates.
12.9 Modifying Multiple XML Nodes in the Same Document 345 UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify( do delete $new/customerinfo/addr , do insert <POBox>15</POBox> into $new/customerinfo/addr ) return $new ') WHERE cid = 1002 Original document <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <phone type="work">905-555-7258</phone> </customerinfo> Figure 12.31 12.9.2 Updated document <customerinfo Cid="1002"> <name>Jim Noodle</name> <phone type="cell">777-555-3333</phone> </customerinfo> A different combination of an insert and a delete operation Converting Elements to Attributes and Vice Versa The UPDATE statement in Figure 12.32 is another interesting example. It combines two insert expressions and two delete expressions in a single statement. The objective is to turn the existing Cid attribute into an element called customerid, and the existing element name into an attribute called custname . Four update operations are required to make this happen: • Insert a customerid element and compute its value from the existing Cid attribute • Insert a custname attribute and take its value from the existing name element • Delete the existing Cid attribute • Delete the existing name element Again, the order of these four expressions in the modify clause does not matter. Snapshot semantics ensures that the four expressions are applied in isolation and produce the intended result. In particular, the insert expressions see their own logical snapshots of the original document, which enables them to read the Cid attribute and the name element even though these nodes are being deleted at the same time.
346 Chapter 12 Updating and Transforming XML Documents UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify(do insert <customerid> {$new/customerinfo/data(@Cid)} </customerid> as first into $new/customerinfo , do insert attribute custname {$new/customerinfo/name} into $new/customerinfo, do delete $new/customerinfo/@Cid, do delete $new/customerinfo/name ) return $new') WHERE cid = 1002 Document before the update Document after the update <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <phone type="work">905-555-7258</phone> </customerinfo> <customerinfo custname=Jim Noodle"> <customerid>1002</customerid> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <phone type="work">905-555-7258</phone> </customerinfo> Figure 12.32 12.10 Turning an attribute into an element and vice versa MODIFYING XML DOCUMENTS IN QUERIES Throughout this chapter you have seen many examples of XQuery Update expressions (transform expressions) that are enclosed in an XMLQUERY function in the SET clause of an SQL UPDATE statement. In this manner, they modify existing documents in the database with logging and locking as needed. The exact same update expressions can also be used in an XMLQUERY function in the SELECT clause of a query. This allows you to modify XML documents as you read them from the database and return them to the application, without changing the original document in the table. Figure 12.33 shows how any UPDATE statement in this chapter can be “converted” into a SELECT statement that performs the same document modifications upon read rather than update of a document. This can be useful for various purposes. When you develop XQuery Update expressions it is convenient to first test them as queries rather than updates. The queries show the modified documents immediately so you can easily check whether your update expressions did exactly what you had intended. Secondly, the queries do not change your data on disk, which makes testing much safer in case a miscoded update expression deletes data unintentionally.
12.10 Modifying XML Documents in Queries 347 If the XML data in your database is used by multiple applications or services, it is likely that not all the consumers need or want to see the XML documents in the same shape and form. To retrieve only certain parts of an XML document, it can sometimes be easier to delete one part of the document while retrieving it, rather than to extract many different parts from a document except one. Also, using insert into expressions in an XML query allows you to enrich an XML document on-the-fly with XML fragments from other documents. -- The UPDATE statement: UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace value of $newinfo/customerinfo/addr/pcode-zip with "XXX XXX" return $newinfo ') WHERE cid = 1000 ; -- Corresponding SELECT statement: SELECT XMLQUERY(' copy $newinfo := $INFO modify do replacevalue of $newinfo/customerinfo/addr/pcode-zip with "XXX XXX" return $newinfo ') FROM customer WHERE cid = 1000 ; Figure 12.33 Corresponding UPDATE and SELECT statements You can write the SELECT statement in Figure 12.33 also as an XQuery, which is shown in Figure 12.34. The copy clause assigns the input document that is to be updated to the variable $newinfo. The input document is produced by the function db2-fn:sqlquery, which contains an SQL query that retrieves exactly one document. The modify clause uses a replace expression to modify the document. xquery copy $newinfo := db2-fn:sqlquery("SELECT info FROM customer WHERE cid=1000") modify do replace value of $newinfo/customerinfo/addr/pcode-zip with "XXX XXX" return $newinfo; Figure 12.34 Replacing values in a document upon read
348 Chapter 12 Updating and Transforming XML Documents You might find it more intuitive to use the XQuery transform expression in the return clause of a FLWOR expression and to modify the documents that are selected by the for and where clauses with XML predicates (see Figure 12.35). xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO") where $i/customerinfo/name = "Jim Noodle" return copy $newinfo := $i modify do replace value of $newinfo/customerinfo/addr/pcode-zip with "XXX XXX" return $newinfo; Figure 12.35 XQuery transform expression in the return clause of a FLWOR In the next example (Figure 12.36), the goal is to obtain the address for Matt Foreman, rename the pcode-zip element to postalcode, and embed the modified address in a new element called sendto. This example shows how it can sometimes be useful to have non-trivial expressions in the copy and the return clause of the transform expression. The copy clause navigates to the addr element because only that part of the document is to be modified and returned. The return clause constructs a new element around the modified address. xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo where $i/name = "Matt Foreman" return copy $newinfo := $i/addr modify do rename $newinfo/pcode-zip as "postalcode" return <sendto>{$newinfo}</sendto> ; <sendto> <addr country="Canada"> <street>1596 Baseline</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <postalcode>M3Z 5H9</postalcode> </addr> </sendto> 1 record(s) selected. Figure 12.36 Returning modified parts of an XML document Figure 12.37 shows a particularly interesting example of inserting XML elements from one document into another. The query reads the document for customer 1002 from the customer table and applies two changes to it. The first update expression inserts the partid elements of all the
12.11 Modifying XML Documents in Insert Operations 349 items that the customer has ever ordered, and the second update expression removes the customer’s phone numbers from the document. The partid elements are obtained with a subquery from the purchaseorder table. The relational column cid of the customer table is passed as a variable ($CID) into the embedded SQL query, which selects the order documents for the given customer. The part IDs of all items (//item/partid) are extracted from each order document and inserted at the end of the customer document. In this case the insert expression inserts a sequence of elements, not just a single element. SELECT XMLQUERY(' copy $new := $INFO modify ( do insert db2-fn:sqlquery("SELECT porder FROM purchaseorder WHERE custid=parameter(1)", $CID)//item/partid as last into $new/customerinfo, do delete $new/customerinfo/phone) return $new') FROM customer WHERE cid = 1002; <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <partid>100-100-01</partid> <partid>100-103-01</partid> <partid>100-100-01</partid> <partid>100-100-01</partid> <partid>100-101-01</partid> <partid>100-201-01</partid> </customerinfo> 1 record(s) selected. Figure 12.37 12.11 Inserting XML elements from one document into another MODIFYING XML DOCUMENTS IN INSERT OPERATIONS In addition to performing update operations when you select or update an XML document, you can also apply changes to a document when you insert it into a table. As an example, assume an application receives XML documents that contain phone numbers, but the application is not supposed to store these phone numbers in a shared database. You can write an INSERT statement that takes an XML document as input, deletes the phone elements, and inserts the modified
350 Chapter 12 Updating and Transforming XML Documents document into a table. Such an INSERT statement is shown in Figure 12.38. The target table is called cust2 and is created as follows: CREATE TABLE cust2 (cid INT, info XML) The INSERT statement extracts the Cid attribute value from each input document and inserts it into the relational cid column of the table. It also inserts the document without phone elements into the XML column info. Note the use of the single dot on the right side of the copy expression. The dot refers to the full XML document provided by the row generating expression $i. INSERT INTO cust2(cid, info) SELECT x.custid, x.xdoc FROM XMLTABLE('$i' PASSING CAST(? AS XML) AS "i" COLUMNS custid INTEGER PATH 'customerinfo/@Cid', xdoc XML PATH 'copy $newinfo := . modify do delete $newinfo/customerinfo/phone return $newinfo' ) AS x Figure 12.38 12.12 Deleting elements upon insertion MODIFYING XML DOCUMENTS IN UPDATE CURSORS As you would normally do for relational columns, you can declare a cursor FOR UPDATE of an XML column and then issue an UPDATE statement with the clause WHERE CURRENT OF cursorname. You can use such update cursors in your application programs or in stored procedures. Figure 12.39 shows sample code for a stored procedure that uses an update cursor to modify XML data. (...) DECLARE doc XML; DECLARE c1 CURSOR FOR SELECT info FROM customer FOR UPDATE OF info; OPEN c1; FETCH c1 INTO doc; WHILE SQLCODE <> 100 DO —(some processing logic here) IF (condition) THEN UPDATE customer SET xmldoc = XMLQUERY('copy $new := $INFO modify do delete $new/customerinfo/phone Figure 12.39 Updating XML documents in a cursor
12.13 XML Updates in DB2 for z/OS WHERE CURRENT OF c1; END IF; 351 return $new') FETCH c1 INTO doc; END WHILE; CLOSE c1; Figure 12.39 12.13 Updating XML documents in a cursor (Continued) XML UPDATES IN DB2 FOR Z/OS DB2 9 for z/OS does not currently support XML modifications other than full-document replacement. Enhanced XML update support is intended for a future release. For now, this leaves you with two options whenever you want to update a piece of a document. One option is to read the entire document into the application, modify it there, and replace the full document. The other option is to perform XML construction with the SQL/XML publishing functions to mimic an XML update. This means that you reconstruct the original document, using existing pieces if they don’t change, and construct new elements where needed. For example, assume you want to replace all existing phone elements in a customer document with a new phone element. The UPDATE statement in Figure 12.40 achieves exactly that. It constructs the updated document in the following way. The first XMLELEMENT function produces the root element customerinfo. This element contains the XML data produced by the subsequent XMLQUERY and XMLELMENT functions. The XMLQUERY functions copy unchanged pieces of the original document into the new document. They include the Cid attribute, the name element, and the entire addr element with all its children. The existing phone element is not copied. In its place a new phone element with a new value is constructed. UPDATE customer SET info = XMLELEMENT(name "customerinfo", XMLQUERY('$i/customerinfo/@Cid' PASSING info AS "i"), XMLQUERY('$i/customerinfo/name' PASSING info AS "i"), XMLQUERY('$i/customerinfo/addr' PASSING info AS "i"), XMLELEMENT(Name "phone", XMLATTRIBUTES('home' as "type"), '408-463-4963') ) WHERE cid = 1000 Figure 12.40 Reconstructing an existing document with a new phone element Before you execute the UPDATE statement in Figure 12.40 you can run a query that constructs the same document (see Figure 12.41). This enables you to verify that the document construction achieves the desired result.
352 Chapter 12 Updating and Transforming XML Documents SELECT XMLELEMENT(name "customerinfo", XMLQUERY('$i/customerinfo/@Cid' PASSING info AS "i"), XMLQUERY('$i/customerinfo/name' PASSING info AS "i"), XMLQUERY('$i/customerinfo/addr' PASSING info AS "i"), XMLELEMENT(Name "phone", XMLATTRIBUTES('home' as "type"), '408-463-4963') ) FROM customer WHERE cid = 1000 Figure 12.41 12.14 Selecting an existing document with a new phone element TRANSFORMING XML DOCUMENTS WITH XSLT Let’s begin by reviewing some terms. XSL stands for eXtensible Stylesheet Language. XSLT stands for XSL Transformation,which is a subset of XSL focusing on document transformations. An XSLT style sheet contains the instructions to transform an existing XML document into a different format. The output of an XSL transformation can be a new XML document that has a different structure than the input document. The output can also be HTML or some non-XML format such as a flat file. These options are illustrated in Figure 12.42. Since version 9.5, DB2 for Linux, UNIX, and Windows supports the XSLTRANSFORM function to perform XSL transformations in SQL statements. The input to an XSL transformation consists of an XML document that you want to transform and the XSLT style sheet that defines the transformation. XSLT Version 1.0 is supported. Teaching XSLT is outside the scope of this book, but the examples in this section are kept simple so that you can follow along without deep XSLT knowledge. XSLT Style Sheet 1 <dept bldg=“101”> <employee id=“901”> <name>John Doe</name> <phone>408 555 1212</phone> <office>344</office> </employee> </dept> Figure 12.42 XSLT Style Sheet 2 XSLT Style Sheet 3 <emp name=“John Doe”> <empNo>901</empNo> <contact> <phone>408 555 1212</phone> <room>344</room> </contact> </emp> John Doe;901;408 555 1212;344 HTML Transforming XML When should you use XSLT and when should you use XQuery Updates to modify an XML document? XQuery Updates typically perform better than XSL transformations because they do not
12.14 Transforming XML Documents with XSLT 353 incur any XML parsing costs of the target document. XQuery Updates are also particularly well suited for transactional updates that modify small or moderate portions of a document (see Table 12.3). XSLT can have advantages if you need to convert XML documents into drastically different formats, including HTML. Also, XSLT has been around for much longer than XQuery Updates, so you might have existing XSLT style sheets that you might want to use with the XML data in DB2. Table 12.3 When to Use XSLT or XQuery XQuery Update XSLT Change, insert, delete specific elements/attributes (“point updates”) + – High-performance database transactions + – Produce custom XML formats for specific consumers – + Format how XML data is rendered in a browser – + If you decide to use XSLT, another consideration is whether to perform the XSLT processing in the database layer (using DB2’s XSLTRANSFORM function), the mid-tier, or the application layer. A big factor in this decision is where you want to incur the CPU consumption. XSLT processing tends to be CPU intensive and the CPU cycles in the mid-tier or application layer may be less expensive than the CPU cycles on the database server. In this case you may want to avoid XSLT in the database server. On the other hand, performing XSLT transformations as part of a database query over XML data can be very convenient. It avoids additional logic in the consuming applications and serves XML data directly in the format that a particular application requires. 12.14.1 The XSLTRANSFORM Function In its most simple form the XSLTRANSFORM function has the following syntax: XSLTRANSFORM(xmldocument USING xslstylesheet) The first parameter, xmldocument, provides a well-formed XML document as data type XML, CHAR, VARCHAR, CLOB, or BLOB. This is the document that is transformed using the XSL style sheet specified in the second parameter, xslstylesheet. The style sheet can also be of type XML, CHAR, VARCHAR, CLOB, or BLOB and must represent a valid XSLT 1.0 style sheet. Let’s look at the examples in Figure 12.43. Both queries retrieve an XML document from the info column of the customer table and apply a style sheet that is provided via a parameter marker. SQL requires that the parameter marker is cast to an appropriate target type. In the first query the style sheet is passed as a VARCHAR(1000) value, in the second query as an XML value. The result type of the XSLTRANSFORM function is CLOB(2G). The result type is not XML because there is no guarantee that the output of the transformation is an XML document.
354 Chapter 12 Updating and Transforming XML Documents SELECT XSLTRANSFORM(info USING CAST(? AS VARCHAR(1000)) ) FROM customer WHERE cid = 1000; SELECT XSLTRANSFORM(info USING CAST(? AS XML) ) FROM customer WHERE cid = 1000; Figure 12.43 SQL queries with XSL transformation Figure 12.44 shows that you can optionally specify a different result type, such as VARCHAR, CHAR, or BLOB. In this example, the style sheet is passed into the XSLTRANSFORM function as VARCHAR(1000), and the output of the transformation is of type VARCHAR(32000). You cannot specify type XML as the output type. SELECT XSLTRANSFORM(info USING CAST(? AS VARCHAR(1000)) AS VARCHAR(32000) ) FROM customer WHERE cid = 1000 Figure 12.44 SQL query with XSL transformation and custom result type If the XSL transformation produces a well-formed XML document and you have a strong reason to return it to the application in a column of type XML, wrap the function XMLPARSE around the XSLTRANSFORM function in the SELECT list. However, you typically can and should avoid the XMLPARSE function because it introduces additional XML parsing overhead. All the examples so far apply the XSL transformation as part of a query but do not modify the original document in the table. You can use the XSL transformation function in an UPDATE statement to replace a document with a transformation of itself, as shown in Figure 12.45. The UPDATE statement performs implicit XML parsing; that is, the VARCHAR(5000) result of XSLTRANSFORM is automatically parsed to produce the required data type XML for the info column. The update fails if XSLTRANSFORM doesn’t produce a well-formed XML document. UPDATE customer SET info = XSLTRANSFORM(info USING CAST(? AS VARCHAR(5000)) ) WHERE cid = 1000 Figure 12.45 SQL UPDATE with XSL transformation Optionally, the XSLTRANSFORM function can also accept a third parameter that provides an XML document containing parameter values for the style sheet, which allows for more flexibility.
12.14 Transforming XML Documents with XSLT 355 It is further possible to apply the XSLTRANSFORM function to just a portion of an XML document. For example, assume you have a style sheet that transforms the address of a customer document. The query in Figure 12.46 uses the XMLQUERY function to extract just the addr branch of the document and provides this as input to the XSLTRANSFORM function. The addr is then transformed by a style sheet that is provided via parameter marker of type VARCHAR(3000). SELECT XSLTRANSFORM( XMLQUERY('$INFO/customerinfo/addr') USING CAST(? AS VARCHAR(3000)) ) FROM customer WHERE cid = 1000 Figure 12.46 SQL/XML query with XSL transformation of a document fragment If you frequently need to perform XSL transformations, you might want to store the XSL style sheets in a DB2 table instead of supplying them via a parameter marker. For example, you could create the table in Figure 12.47 where each row contains an XSL style sheet (xsldoc) and an INTEGER number (xslid) that serves as a style sheet identifier. CREATE TABLE xslfile(xslid INTEGER PRIMARY KEY NOT NULL, xsldoc CLOB(1M) ) Figure 12.47 Defining a table for XSLT style sheets Such a table allows you to pull specific style sheets into an invocation of the XSLTRANSFORM function, as shown in Figure 12.48. Remember that the style sheet can be of type XML, CHAR, VARCHAR, CLOB, or BLOB. SELECT XSLTRANSFORM(info USING (SELECT xsldoc FROM xslfile WHERE xslid = 2) ) FROM customer WHERE cid = 1004 Figure 12.48 Using an XSL style sheet from a DB2 table There can be situations where different documents require transformation with different style sheets. For example, different documents might belong to different versions of an XML Schema or be consumed by different applications. You can add an INTEGER column xslid to the table that contains your XML documents and use it to indicate which style sheet is appropriate for any particular document. Then you can perform a join, as in Figure 12.49, to transform multiple documents against their respective style sheets.
356 Chapter 12 Updating and Transforming XML Documents SELECT XSLTRANSFORM (info USING xsldoc) FROM customer c, xslfile x WHERE c.xslid = x.xslid Figure 12.49 12.14.2 Joining style sheets and XML documents XML to HTML Transformation Assume you want to read the contents of an XML document in the info column of the customer table and produce the Cid, name, and street information in HTML format. Remember that Cid is an attribute of the root element and that name and street are elements. The INSERT statement in Figure 12.50 inserts a suitable XSL style sheet into the table xslfile. INSERT INTO xslfile VALUES (1, '<?xml version="1.0" encoding="isO-8859-1"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <html> <body> <h2>The customer</h2> <table border="1"> <tr bgcolor="#FFF00"> <th align="left">Cid</th> <th align="left">Name</th> <th align="left">Street</th> </tr> <xsl:for-each select="customerinfo"> <tr> <td><xsl:value-of select="@Cid"/></td> <td><xsl:value-of select="name"/></td> <td><xsl:value-of select="addr/street"/></td> </tr> </xsl:for-each> </table> </body> </html> </xsl:template> </xsl:stylesheet>') Figure 12.50 INSERT statement for the XSL transformation style sheet Note that the XSL style sheet contains the required namespace declaration for the XSLT, xmlns:xsl=http://www.w3.org/1999/XSL/Transform. If your XML data contains namespaces, then they should be declared after the XSL function namespace. After you have inserted your XSL style sheet you can select from the customer table using the XSLTRANSFORM function, as in Figure 12.51. It produces an HTML page for the selected customer. To produce an HTML page with information for multiple customers, this information
12.14 Transforming XML Documents with XSLT 357 needs to be aggregated in a single XML document (XMLAGG) and passed in the XSLTRANSFORM function. SELECT XSLTRANSFORM(info USING (SELECT xsldoc FROM xslfile WHERE xslid = 1) AS CLOB(1M)) FROM customer WHERE cid = 1000; <html> <body> <h2>The customers</h2> <table border="1"> <tr bgcolor="#FFF00"> <th align="left">Cid</th> <th align="left">Name</th> <th align="left">Street</th> </tr> <tr> <td>1000</td> <td>Kathy Smith</td> <td>5 Rosewood</td> </tr> </table> </body> </html> 1 record(s) selected. Figure 12.51 Transforming XML to HTML in an SQL/XML query If you copy the output from the query in Figure 12.51 into a file then you can use a web browser to view the rendered HTML page, as illustrated in Figure 12.52. Figure 12.52 HTML output rendered in a browser Note that the HTML page produced in Figure 12.51 is quite static and hardcoded in the XSLT style sheet. Only three values are dynamically extracted from the XML document and plugged into the HTML code. The same HTML code can be produced with XML construction in an XMLQUERY function in a SELECT statement (see Figure 12.53). This is more efficient than using XSLT and consumes less CPU cycles. It’s an indication that XQuery, with its construction and update capabilities, often allows you to avoid or replace XSLT processing and gain higher performance at the same time.
358 Chapter 12 Updating and Transforming XML Documents SELECT XMLQUERY(' <html> <body> <h2>The customers</h2> <table border="1"> <tr bgcolor="#FFF00"> <th align="left">Cid</th> <th align="left">Name</th> <th align="left">Street</th> </tr> <tr> <td>{$INFO/customerinfo/data(@Cid)}</td> <td>{$INFO/customerinfo/name/text()}</td> <td>{$INFO/customerinfo/addr/street/text()}</td> </tr> </table> </body> </html> ') FROM customer WHERE cid = 1000 Figure 12.53 Producing HTML using direct XML constructors Sometimes XQuery with construction or update expressions can achieve the same result as an XSLT style sheet. In that case it is recommended to use XQuery—possibly embedded in SQL—rather than XSLT. XQuery provides better performance than XSLT. NOTE 12.15 SUMMARY DB2 supports several methods to update or transform XML documents. A simple method is fulldocument replacement, which allows an application to replace an existing XML document with an updated document. An application can read an XML document from the database, parse and modify it with custom application code, and then use it in an UPDATE statement to replace the original version of the document. In DB2 for Linux, UNIX, and Windows, the XQuery Update Facility also enables you to modify, insert, or delete individual elements and attributes within an XML document—without reading the document into your application. This method often provides better performance than fulldocument replacement. XQuery Updates provide a standardized and declarative way to express XML document modifications, which is more efficient and less error-prone than procedural custom code. XQuery Updates are most commonly embedded in SQL UPDATE statements to change existing documents in the database. They can also be used in queries or SQL INSERT statements
12.15 Summary 359 to modify documents on their way into or out of the database. A single statement can contain a list of multiple XQuery Update expressions to apply multiple modifications to a document. Keep in mind that each individual XQuery Update expression modifies exactly one element or attribute at a time, but you can iterate with a for clause to handle missing or repeating elements. You can also use predicates and if-then-else expressions to code conditional updates or “upsert” operations. DB2 for Linux, UNIX, and Windows also supports XSL transformations that can transform an XML document into a different XML format, HTML, or into a non-XML format. Optionally, XSLT style sheets may contain parameters to make transformations more flexible and dynamic. You can apply XSL transformations to full documents in queries, updates, and inserts, or to partial documents produced by an SQL/XML query.
This page intentionally left blank
C H A P T E R 13 Defining and Using XML Indexes his chapter looks at defining and using XML indexes to improve performance. The main reasons why you want to define indexes on XML data are the same as for relational data; that is, to evaluate predicates efficiently and to avoid table scans. Just like you define relational indexes on selected columns of a relational table, you define XML indexes on selected elements and attributes within a single XML column of a table. In particular, XML indexes in DB2 do not automatically index all the values in an XML column, but only the ones that you choose. Although you can choose to index all elements and attributes, you should typically index just those elements and attributes that are frequently used in predicates and join conditions. T XML and relational indexes also have other similarities. Both are physically implemented as B-Tree structures. Whenever XML documents are inserted, updated, or deleted, the affected XML indexes are immediately updated before the transaction commits. This behavior is known as synchronous index maintenance. DB2 can also collect statistics for XML indexes and use them to generate efficient access plans (also known as execution plans). In this chapter we explain • How to define XML indexes and their data types (sections 13.1 and 13.2) • The usage of XML indexes to improve query performance (sections 13.3 through 13.5) • Cases in which XML indexes cannot be used to evaluate query predicates (section 13.6) • An inside look at XML index internals and statistics (sections 13.7 and 13.8) 361
362 13.1 Chapter 13 Defining and Using XML Indexes DEFINING XML INDEXES To illustrate XML indexes in this chapter, we use the following table: CREATE TABLE books (id INT, bookinfo XML) The XML column bookinfo contains the XML documents shown in Figure 13.1. Each document has a root element called book, and under that are three other elements: title, price, and authors. Some of the documents have additional elements under book, such as the publication date pubdate. Each authors element has one or more author child elements. <book id="101" isbn="0-321-18060-7"> <title>International Pasta</title> <price>31</price> <pubdate>2005-02-14</pubdate> <authors> <author id="2000">Frank Peterson</author> <author id="2001">Mary Smith</author> </authors> </book> <book id="102" isbn="0-596-00252-1"> <title>Bedtime Stories</title> <price>£32</price> <authors> <author id="2001">Mary Smith</author> </authors> </book> <book id="103" isbn="1-59059-983-7"> <title>The Moon and I</title> <price>33</price> <authors> <author id="2002">Tom Noodle</author> </authors> </book> Figure 13.1 Sample documents in the books table The first example of an XML index is the CREATE INDEX statement in Figure 13.2. As for relational indexes, you need to specify the table name (books) and the column (bookinfo) that you want to index. At most one column can be specified in an XML index. The clause GENERATE KEYS USING XMLPATTERN defines a path expression that identifies the elements or attributes whose values you want to index. In this case, the title elements of all books are indexed. Finally, an SQL data type indicates how the indexed values should be represented in the index. In this index, the book titles are represented as VARCHAR(50) values. CREATE INDEX idx1 ON books(bookinfo) GENERATE KEYS USING XMLPATTERN '/book/title' AS SQL VARCHAR(50) Figure 13.2 Simple example of creating an XML index
13.1 Defining XML Indexes 363 The specification of a data type is a key difference from relational indexes. When you define a relational index on a relational column, the data type of the column automatically defines the data type of the index keys. When you index XML data, such as the title elements in this example, the appropriate data type for the index keys is not known to DB2. This is because DB2 does not force you to use a single fixed XML Schema for all documents in the XML column. The use of an XML Schema is optional, and different documents in one column can potentially use different XML Schemas. XML index data types are discussed in more detail in section 13.2. As another example, consider the two indexes in Figure 13.3. The first one indexes the author elements. Note that there can be multiple author elements per document. Hence, this index can contain multiple index entries that point to the same row and document. As a result, the cardinality of the index can be larger than the cardinality of the table. The second index in Figure 13.3 indexes the pubdate element, which exists in only one of the three sample documents. This index contains index entries only for those rows where the pubdate element exists. Therefore, when you index an optional element that occurs only in a subset of the documents, the cardinality of the index can be less than the cardinality of the table. As a result, the size of indexes on optional elements can be quite small, depending on how frequently the element occurs in the XML column. CREATE INDEX idx2 ON books(bookinfo) GENERATE KEYS USING XMLPATTERN '/book/authors/author' AS SQL VARCHAR(50) CREATE INDEX idx3 ON books(bookinfo) GENERATE KEYS USING XMLPATTERN '/book/pubdate' AS SQL DATE Figure 13.3 XML indexes with a variable number of keys per row In general, an XML index can contain zero, one, or multiple index entries per row (document). This is different from relational indexes, which contain exactly one index entry for each row. The XMLPATTERN in the index definition can contain XPath expressions and wildcards (* and //) and can point to attributes or elements. XPath predicates or parent steps are not allowed in the XMLPATTERN. If your XML data contains namespaces, then this also needs to be reflected in the XMLPATTERN. Indexes with namespaces are explained in Chapter 15, Managing XML Data with Namespaces. The index in Figure 13.4 uses the XMLPATTERN //@id. For the sample data in Figure 13.1, this pattern matches the book id attributes on the path /book/@id as well as the author id attributes on the path /book/authors/author/@id. Hence, this index contains multiple entries per row (document). CREATE INDEX idx3 ON books(bookinfo) GENERATE KEYS USING XMLPATTERN '//@id' AS SQL VARCHAR(10) Figure 13.4 Indexing nodes at multiple paths
364 Chapter 13 Defining and Using XML Indexes Figure 13.5 shows the most relevant parts of the CREATE INDEX statement syntax for XML indexes in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. Some details are omitted for simplicity, such as namespaces and the rarely needed capabilities to index XML comments and processing instructions. Note that DB2 for z/OS supports two data types for XML indexes, VARCHAR(n) and DECFLOAT, whereas DB2 for Linux, UNIX, and Windows supports all data types shown in Figure 13.5, except DECFLOAT. XML index data types and the handling of invalid values are discussed in section 13.2. Unique indexes are covered in section 13.1.1. >>-CREATE--+--------+--INDEX--index-name-----------------------> '-UNIQUE-' >--ON table-name (xml-column-name)-----------------------------> >--GENERATE KEYS USING XMLPATTERN------------------------------> .--------------------------------------------------. ↓ | >--'----+-/--+--+-----+----+-| xml element name |-+----+--'--> '-//-' '--@--' +-| xml attribute name |-+ +-- text() --------------+ '-- * -------------------' >--AS-SQL--+-VARCHAR--+-(--integer--)-+-+----------------------> | '-HASHED--------' | +-DOUBLE---------------------+ +-DECFLOAT-------------------+ +-DATE-----------------------+ '-TIMESTAMP------------------' .-IGNORE INVALID VALUES-. >------+-----------------------+--------------------| '-REJECT INVALID VALUES-' Figure 13.5 13.1.1 Syntax diagram for the XML clauses of the CREATE INDEX statement Unique XML Indexes You can use the UNIQUE keyword in the CREATE INDEX statement to enforce uniqueness across and within all XML documents stored in a single XML column. The uniqueness of a node is enforced using the index data type, the XML path to the node, and the value of the node after being cast to the index data type. To enforce uniqueness of all author id attributes you can define the index in Figure 13.6. CREATE UNIQUE INDEX authorIdx ON books(bookinfo) GENERATE KEY USING XMLPATTERN '/book/authors/author/@id' AS SQL DOUBLE Figure 13.6 Creating a unique index
13.1 Defining XML Indexes 365 The XMLPATTERN in a unique XML index cannot contain a wildcard (*) or a double slash (//). The reason for this restriction is that enforcing uniqueness in an index with wildcard or double slash can require probing the index multiple times for each value that is inserted. To avoid this overhead, unique indexes have to be specified with fully qualified paths. 13.1.2 Lean XML Indexes A lean index is one that defines the path to just the element or attribute that you want to index, without including any additional elements or attributes that do not need to be indexed. A lean index always requires a fully qualified path; that is, a path without a wildcard (*) or double slash (//). As an example, suppose you often search for books and authors via the author id attribute. Then you can define the index in Figure 13.7 where the path to the author id attributes has been fully specified. This index is lean because no other nodes are included in the index than the ones you intended to index. CREATE INDEX idx1 ON books(bookinfo) GENERATE KEY USING XMLPATTERN '/book/authors/author/@id' AS SQL VARCHAR(30) Figure 13.7 A lean index on author id attributes Alternatively, you could decide to define the index in Figure 13.8, which uses the double slash (//) to index id attributes anywhere in the document. This index is “heavier” than the previous one because it contains index entries for the id attributes of the author and the book elements. If you never search for books via their id attributes then including them in the index is wasted overhead. The size of the index as well as the cost to maintain it during insert, update, and delete operations is larger than necessary. For example, when a new document is inserted, it is more efficient to navigate straight to /book/authors/author/@id to obtain the index keys than to traverse the entire document tree to evaluate //@id. CREATE INDEX idx2 ON books(bookinfo) GENERATE KEY USING XMLPATTERN '//@id' AS SQL VARCHAR(30) Figure 13.8 A “heavy” index on all id attributes If you need to support queries that search by author id as well as queries that search by book id, it is recommended to define two indexes, one on /book/authors/author/@id and one on /book/@id. This approach provides better performance than a single index on //@id. When defining indexes you should always try to use full-specified path expressions and avoid using wildcards (*) or double slashes (//).
366 13.1.3 Chapter 13 Defining and Using XML Indexes Using the DB2 Control Center to Create XML Indexes The DB2 Control Center contains a Create Index Wizard to conveniently create indexes on elements or attributes in an XML column. In the Control Center, right-click on the folder Indexes or on a table name and select Create and Index to bring up the Create Index Wizard shown in Figure 13.9. In the wizard, first specify the schema and name of the table. Then click the radio buttons to indicate that you want to index an XML column and whether you want to ignore or reject invalid XML values in the index. The options Reject and Ignore are explained in section 13.2.6. Figure 13.9 Creating an XML index using the DB2 Control Center The wizard then lets you select an XML column before it takes you to the “3. XML Pattern” dialog shown in Figure 13.10. In the upper part of this dialog all existing XML indexes on the selected column are listed. The lower part shows the tree structure of a sample document from the XML column. In that tree, highlight the element or attribute that you want to index and click the Add Index button. You are then presented with options to index either just the selected node, which is the common and default case, or to index specific child or descendant nodes. When you click OK, the new index is added to the list of indexes above where you can still edit any of the index properties. You cannot alter any previously defined indexes in this dialog.
13.2 XML Index Data Types Figure 13.10 367 The XML Pattern screen in the XML Create Index wizard One of the key benefits of the wizard is that it creates the XMLPATTERN for you and includes the appropriate declarations for any namespaces that might exist in the sample document from the XML column. In the final steps of the wizard you can review the textual CREATE INDEX statement DDL statement before executing it. 13.2 XML INDEX DATA TYPES In DB2 for Linux, UNIX, and Windows there are five data types for XML indexes: VARCHAR(n), VARCHAR HASHED, DOUBLE, DATE, and TIMESTAMP. DB2 for z/OS supports two SQL data types for XML indexes: VARCHAR(n) and DECFLOAT. The XML index data types and their respective behaviors are discussed in this section. 13.2.1 VARCHAR(n) The data type VARCHAR(n) is used to index string values that have a fixed maximum length. The length n is a hard constraint. This means • If you try to insert an XML document in which the indexed elements or attributes have a
368 Chapter 13 Defining and Using XML Indexes value longer than n, the INSERT statement fails and the document is not inserted. • An UPDATE statement fails if it tries to assign a value longer than n bytes to the node that is indexed as VARCHAR(n). • If you try to create a new index over an XML column that already contains XML documents, and if these documents contain nodes on the index path with values longer than n, then the index is not created and the CREATE INDEX statement fails. Note that the value n indicates bytes and not characters. If you are managing data in a code page where each character is represented by multiple bytes, you need to choose the value n accordingly. The minimum value for n is 1. The maximum value for n is 1000 in DB2 for z/OS. In DB2 for Linux, UNIX, and Windows the maximum value for n depends on the page size, as shown in Table 13.1. Table 13.1 Maximum Key Length of a VARCHAR Index for a Given Page Size Page size (KB) Maximum value of n (in bytes) 4 817 8 1841 16 3889 32 7985 If you prefer to avoid the rejection of documents when they violate the specified VARCHAR length, you can safely specify the maximum allowed length for the index keys, such as VARCHAR(817) if your page size is 4KB. Using the maximum length does not waste any space, because in each index entry only the actual size of the indexed value is allocated and not the full 817 bytes. 13.2.2 VARCHAR HASHED The data type VARCHAR HASHED allows you to index character strings of arbitrary length. The length of the indexed string has no limits. DB2 generates an eight-byte hash code over the entire string and uses the hash code as the index key. Hashed indexes are useful for string values that are longer than the values supported by VARCHAR(n) for the respective page size. Using the VARCHAR HASHED data type can also significantly reduce the space consumption of an index. For example, if you index an element whose values are 100 bytes long, then in each entry a VARCHAR HASHED index uses 92 bytes less than a VARCHAR(100) index. The disadvantage of a hashed index is that it can only be used for equality predicates, but not for less-than and greater-than predicates. Figure 13.11 shows an example of an index with VARCHAR HASHED as the index type.
13.2 XML Index Data Types 369 CREATE INDEX idx1 ON books(bookinfo) GENERATE KEYS USING XMLPATTERN '/book/title' AS SQL VARCHAR HASHED Figure 13.11 Creating an index of type VARCHAR HASHED 13.2.3 DOUBLE and DECFLOAT In DB2 for Linux, UNIX, and Windows, the data type DOUBLE is used to index any kind of numeric values. DB2 for z/OS uses the data type DECFLOAT instead of DOUBLE. If a document is inserted in which the indexed element or attribute has a value that cannot be cast to DOUBLE or DECFLOAT, such as any alphanumeric string, the default behavior is that the document is inserted but no index entry is added to the index. No warning or error is issued. This behavior is safe and never leads to incomplete query results. Due to its data type, a DOUBLE or DECFLOAT index is only used to evaluate numeric predicates where XML nodes with non-numeric values can never be a possible match. Hence, non-numeric values can safely be omitted from the index. If you want to reject documents that contain non-numeric values in the indexed path, you can do so in DB2 for Linux, UNIX, and Windows with the REJECT INVALID VALUES clause of the CREATE INDEX statement (see section 13.2.6). 13.2.4 DATE and TIMESTAMP Use the types DATE and TIMESTAMP to index XML nodes with date or timestamp values, respectively. The SQL data type DATE is used to index XML values of type xs:date. The SQL data type TIMESTAMP is used to index XML values of type xs:dateTime. Remember that xs:dateTime values have the form yyyy-mm-ddThh:mm:ss.nnnnnn, as in <arrival> 2008-10-31T07:45:57.345332</arrival>. If you want to specify a time zone, then you add a “Z” to the timestamp to signify UTC, or you can specify an offset from the UTC time (GMT), for example <arrival>2008-10-31T07:45:57.345332+03:00</arrival>. XML indexes of type DATE and TIMESTAMP behave just like DOUBLE indexes as far as invalid values are concerned. If the value of an indexed element or attribute cannot be cast to the type of the index, the document is inserted but no index entry is generated. In DB2 for Linux, UNIX, and Windows you can choose to reject such documents if you use the REJECT INVALID VALUES clause in the CREATE INDEX statement (see section 13.2.6). 13.2.5 Choosing a Suitable Index Data Type The data type of relational indexes is always determined by the type of the indexed column. However, since DB2 does not force you to associate an XML Schema with an XML column, the data types of XML elements or attributes are not predetermined. Thus, each XML index requires a target type, and the type matters. Assume a price element has the value 9. A string predicate
370 Chapter 13 Defining and Using XML Indexes "9" < "29" is false while a numeric comparison 9 < 29 is true. Similarly, the string predicate "100" = "1E2" is false while a numeric comparison 100 = 1E2 is true. The literal 1E2 is a valid value for the XML data type xs:double. Hence, you should use DOUBLE indexes (DECFLOAT in z/OS) if you want semantically correct numeric comparisons. Similar considerations apply to dates and timestamps when time zones are involved. You can index date and timestamp values as VARCHAR(n), and query them with string comparisons, if they have no time zone indicators or if all values are in the same time zone. This is the recommended approach in DB2 for z/OS. In DB2 for Linux, UNIX, and Windows you should use the index data types DATE and TIMESTAMP instead. These data types ensure correct date and time comparison even across time zones. When you define XML indexes you might find yourself confronted with the following questions: • How do you know whether the values of a certain element can always be cast to DOUBLE, DATE, or TIMESTAMP? • How do you know how large to make the VARCHAR(n) type? The first and best place to look for the answer is the XML Schema associated with your XML documents. If you do not have an XML Schema or if you prefer to explore the actual XML documents in an XML column, you can use some of the following queries. To check the actual maximum length of an element’s value across many documents, use one of the queries shown in Figure 13.12. Note that these queries can be expensive to run over a large number of documents, so you might want to add a WHERE clause to the subselect. Avoid scanning the entire table but just look at a representative subset of the documents. SELECT MAX(LENGTH(title)) FROM (SELECT XMLCAST(XMLQUERY('$BOOKINFO/book/title') AS VARCHAR(50) ) AS title FROM books); SELECT MAX(len) FROM (SELECT XMLCAST(XMLQUERY('$BOOKINFO/book/string-length(title)') AS VARCHAR(500) ) AS len FROM books); Figure 13.12 Checking the maximum length of an element value
13.2 XML Index Data Types 371 Now let’s look at two queries that check whether the value of the price element is indeed numeric in all documents. The first query tries to cast the values of all price elements to DOUBLE: SELECT COUNT( XMLCAST( XMLQUERY('$b/book/price' PASSING bookinfo AS "b") AS DOUBLE) ) FROM BOOKS If all price elements are numeric, this query returns the count of all prices. Otherwise, if a nonnumeric value is encountered, the query fails with the following message: SQL16061N The value "£32" cannot be constructed as, or cast (using an implicit or explicit cast) to the data type "xs:double". Error QName=err:FORG0001. SQLSTATE=10608 The following XQuery is somewhat smarter because it does not fail but returns all the documents in which the price value is not castable to xs:double: xquery for $i in db2-fn:xmlcolumn("BOOKS.BOOKINFO")/book where not($i/price castable as xs:double) return $i 13.2.6 Rejecting Invalid Values DB2 for Linux, UNIX, and Windows provides the option to reject invalid values in XML indexes of type DOUBLE, DATE, or TIMESTAMP. As an example, remember that each of the three sample documents in Figure 13.1 contains a price element. These three price elements are • <price>31</price> • <price>£32</price> • <price>33</price> Note that the first and third price elements have a numeric value while the second price element has a non-numeric value because it contains the character £. Assume you create an index for these price elements as type DOUBLE, shown in Figure 13.13. This index only contains entries for the first and third price elements, not for the second price element whose value cannot be cast to DOUBLE. Any documents where the price element does match the DOUBLE data type are silently ignored. CREATE INDEX idx1 ON books(bookinfo) GENERATE KEY USING XMLPATTERN '/book/price' AS SQL DOUBLE Figure 13.13 Indexing the price elements as DOUBLE
372 Chapter 13 Defining and Using XML Indexes Omitting non-numeric values from the index does not lead to incomplete query results. If a query searches for books where the price is £32, then the literal value £32 in the predicate must be enclosed in double quotes because it is not a numeric value: /book[price="£32"]. The double quotes imply that "£32" is a string, not a number, which means that DB2 cannot use the DOUBLE index to evaluate the predicate anyway. Therefore, the absence of the value £32 from the index in Figure 13.13 does no harm and only saves space. Depending on the nature of your application you might want to guarantee that if a certain element or attribute is indexed as DOUBLE, then all occurrences of that element or attribute are indeed of type DOUBLE. In other words, you might want to enforce the index data type as a hard constraint. Figure 13.14 shows how you can enforce the index data type with the REJECT INVALID VALUES clause. Due to the REJECT INVALID VALUES clause, this CREATE INDEX statement fails with error SQL20306N if the XML column contains one or more documents in which the price element contains a value that cannot be cast to DOUBLE. Similarly, if an application tries to insert a document where the price element has a non-numeric value, the INSERT statement fails with error SQL20305N, reason code 5. CREATE INDEX priceIdx ON books(bookinfo) GENERATE KEY USING XMLPATTERN '/book/price' AS SQL DOUBLE REJECT INVALID VALUES Figure 13.14 Rejecting invalid DOUBLE values The same concepts apply to indexes of type DATE or TIMESTAMP. Figure 13.15 creates an index of type DATE on the pubdate element with the REJECT INVALID VALUES clause. This index definition guarantees that the XML column never contains documents in which the pubdate element contains a value that cannot be cast to type DATE. CREATE INDEX idxpubdate ON books(bookinfo) GENERATE KEY USING XMLPATTERN '/book/pubdate' AS SQL DATE REJECT INVALID VALUES Figure 13.15 Creating an index of type DATE The REJECT INVALID VALUES clause does not affect which values are or aren’t included in the XML index. The contents and structure of the index is exactly the same with or without this clause. The REJECT INVALID VALUES clause only affects which documents are or aren’t allowed to be stored in the XML column. Such constraints, and more complex ones, can also be enforced with XML Schemas.
13.3 Using XML Indexes to Evaluate Query Predicates 13.3 373 USING XML INDEXES TO EVALUATE QUERY PREDICATES In this section we discuss the usage of XML indexes to improve query performance. Let’s start with an example. Suppose you want to write a query to return the authors of all books with a given title and price range. Figure 13.16 shows how this query can be coded. SELECT XMLQUERY('$BOOKINFO/book/authors') FROM books WHERE XMLEXISTS('$BOOKINFO/book[title = "DB2 9 New Features" and price < 50]') Figure 13.16 SQL/XML query with two predicates Without any indexes, this query scans the entire bookinfo column, which is typically inefficient. To define suitable XML indexes, you need to know which paths to index and which data types to use for the indexes. The predicates in Figure 13.16 constrain the title and price elements, which are identified by the paths /book/title and /book/price, respectively. Assuming that book titles are always character values and prices always numeric values, these elements should be indexed as VARCHAR and DOUBLE (or DECFLOAT) values, respectively. Figure 13.17 shows the two corresponding index definitions. These indexes are eligible (can be used) to evaluate the predicates in Figure 13.16. CREATE INDEX idx1 ON books(bookinfo) GENERATE KEYS USING XMLPATTERN '/book/title' AS SQL VARCHAR(50) CREATE INDEX idx2 ON books(bookinfo) GENERATE KEYS USING XMLPATTERN '/book/price' AS SQL DOUBLE Figure 13.17 13.3.1 Creating a VARCHAR and a DOUBLE index for the query in Figure 13.16 Understanding Index Eligibility Index eligibility deals with the question of whether a certain XML index can be used to answer a given query predicate. This question is typically trivial in relational query processing. Any index defined on a single relational column can be used to answer any equality or range predicate on that column. This problem, however, is more difficult for XML columns and XML indexes. An index on a relational column stores all values that appear in the indexed column, but an XML index stores only values of nodes that match the XPath pattern and the data type in the index definition. Therefore, an XML index can be used (is eligible) to answer an XML query predicate only if the index contains all XML nodes that satisfy the query predicate. To ensure this, the following three index eligibility conditions must be met: • The data type of the index and of the predicate must be compatible. • Text nodes must be handled consistently in the index and the predicate.
374 Chapter 13 Defining and Using XML Indexes • The index must “contain” the query predicate; that is, the XMLPATTERN of the index must be equally or less restrictive than the XPath in the predicate. If these conditions are met for a given index and predicate, the DB2 optimizer is allowed to consider the index in the generation of the query execution plan. The optimizer then makes a costbased decision whether to use the index or not in order to minimize the execution time of the query. If you encounter a situation where an index is not used that you think should be used, begin your investigation with examining the index definition and the predicate to check whether the three index eligibility conditions are met. To help you do this, we explain the three conditions with examples in the following sections 13.3.2, 13.3.3, and 13.3.4, respectively. 13.3.2 Data Types in XML Indexes and Query Predicates As we explained in section 13.2, the data type of an XML index matters because it determines the type of comparisons that the index can support. For example, indexes of type VARCHAR support general string comparisons but no numeric comparisons. Indexes of type DOUBLE or DECFLOAT support numeric comparisons but no string comparisons, and so on. Remember that comparing strings is different from comparing numbers. A string predicate such as "2" < "100" is false while the numeric comparison 2 < 100 is true. If you want to search for books whose price does not exceed a certain limit, an index on the /book/price element can help query performance. Although price values tend to be numeric in nature, you certainly have the choice to index them as DOUBLE or as VARCHAR, as shown in the two rightmost columns of Table 13.2. However, note that value predicates in a query also have a data type that is determined by the type of the literal value. A value in double quotes, such as “29”, is always a string while a numeric value without quotes is interpreted as a number. The Yes/No entries in Table 13.2 show that a string predicate can only be evaluated with an XML index of type VARCHAR while a numeric predicate can only be evaluated with a numeric index (DOUBLE in DB2 for Linux, UNIX, and Windows and DECFLOAT in DB2 for z/OS). In this sense, the data type of the predicate and the data type of the index have to match. Table 13.2 Index Eligibility for Numeric Versus String Predicates Index definition Predicate $i/book[price < "29"] $i/book[price < 29] …USING XMLPATTERN '/book/price' AS SQL DOUBLE No Y es …USING XMLPATTERN '/book/price' AS SQL VARCHAR(10) Y es No As another example, consider the index of type DATE for the pubdate element (see Figure 13.18). This index can only be used for date predicates. The first of the two SELECT statements in Figure 13.19 cannot use this index, because it performs a simple string comparison. It can only use an index of type VARCHAR. The second SQL/XML statement in Figure 13.19 casts the literal value to xs:date to perform a comparison with date semantics. This predicate can use the DATE index in Figure 13.18.
13.3 Using XML Indexes to Evaluate Query Predicates 375 CREATE INDEX pubDateIndex ON books(bookinfo) GENERATE KEYS USING XMLPATTERN '/book/pubdate' AS SQL DATE Figure 13.18 An index of type DATE SELECT * FROM books WHERE XMLEXISTS('$BOOKINFO/book[pubdate = "2009-06-30"]'); SELECT * FROM books WHERE XMLEXISTS('$BOOKINFO/book[pubdate=xs:date("2009-06-30")]'); Figure 13.19 13.3.3 Querying the pubdate element, with and without using an index of type DATE Text Nodes in XML Indexes and Query Predicates We have previously discussed text nodes in sections 3.1, Understanding XML Document Trees, and 6.2, Understanding the XQuery and XPath Data Model. Remember that the value of an XML element is defined as the concatenation of all text nodes in the subtree underneath that element. Query predicates are almost always expressed on leaf elements; that is, elements at the lowest level of the tree with at most one text node. For example, the price element has only a single text node and therefore the predicates [price = 33] and [price/text() = 33] lead to the same result. Hence, you usually do not need to use /text() in predicates. Therefore you should also not use /text() in index definitions. If you do use /text() in a predicate, then you also need to use /text() in the index that you are hoping to use. These rules are summarized in Table 13.3. Table 13.3 Index Eligibility with Text Nodes Index definition Predicate $i/book[price = 33] $i/book[price/text() = 33] …USING XMLPATTERN '/book/price' AS SQL DOUBLE Yes No …USING XMLPATTERN '/book/price/text()' AS SQL DOUBLE No Yes As a general guideline we recommend not to use /text() in predicates or index definitions. There are cases when using /text() in predicates can be helpful, such as for non-leaf elements whose immediate children are a mix of element and text nodes (this is called mixed content, see section 3.1). However, this is a relatively rare case. For example, consider the following XML document: <title>This is a <bold>great</bold> book about XML</title>
376 Chapter 13 Defining and Using XML Indexes For this document, the XPath pattern /title produces a single index entry with the value “This is a great book about XML”, because the value of the title element is the concatenation of all its descendant text nodes. In contrast, the XPath expression /title/text() creates two index entries, one for the text node “This is a ” and one for the text node “ book about XML”. Indexing of non-leaf elements is further discussed in section 13.5. 13.3.4 Wildcards in XML Indexes and Query Predicates The use of // and * in predicates can affect the containment relationship between an index and a query predicate. For example, the path expressions such as /book/@id and //@id are different. The path /book/@id identifies all id attributes that are immediate children of the element book. But, the path //@id identifies id attributes anywhere at any level of the XML documents, including author id attributes. Thus, /book/@id identifies a subset of the attributes specified by //@id. In this sense, //@id contains /book/@id but not the other way around. Now let’s look at how this affects index eligibility. Consider Figure 13.20 as an example. SELECT * FROM books WHERE XMLEXISTS('$i/book[@id = 101]' PASSING bookinfo AS "i") Figure 13.20 Demonstrating index eligibility Based on this query, Table 13.4 shows four different ways of writing the XPath predicate in the XMLEXISTS. The two rightmost columns of the table represent two alternative index definitions, and the rows in the table show which of the predicates can (Yes) or cannot (No) be evaluated by either of the indexes. Table 13.4 Index Eligibility with Wildcards in XML Indexes and Predicates Index definition Predicate 1 2 3 4 $i//*[@id = 101] $i/book[@id = 101] $i/book[@* = 101] $i/*[@id = 101] …USING XMLPATTERN '/book/@id' AS SQL DOUBLE No Yes No No …USING XMLPATTERN '//@id' AS SQL DOUBLE Yes Yes No Yes For the first predicate, the index on /book/@id is not eligible because it only contains id attributes that are immediate children of book. The index does not contain id attributes at any deeper level, such as the author id attributes. However, an author id attribute with a value of 101 would be a valid match for the predicate path $i//*/@id. Thus, if DB2 used the index on /book/@id it might return an incomplete result. The second index on //@id is eligible because it contains all id attributes at any level of the document, as required for the predicate.
13.3 Using XML Indexes to Evaluate Query Predicates 377 The second predicate specifies a full path to the id attribute, so both indexes are eligible. The first index on /book/@id contains exactly what the predicate path is looking for; that is, book id attributes. The second index on //@id contains even more, and is therefore also eligible. The third predicate uses @* as a wildcard, such that it looks for any attribute of book with a value of 101. Not only id attributes can fulfill this predicate, but for example, a document with value 101 in the attribute /book/@isbn is also a valid match. But, isbn attributes are not included in either of the two indexes. Therefore, neither index is used because DB2 cannot risk returning incomplete results for this predicate. The fourth predicate $i/*[@id = 101] looks for id attributes under any root element, not just book. If there is a document with a path /journal/@id, then it might satisfy the predicate, but it is not included in the index on /book/@id. Therefore, this index cannot be used because DB2 would again risk returning an incomplete query result. However, the index on //@id contains any id attribute, irrespective of the root element, so this index can be used. In a nutshell, the DB2 query compiler always needs to be able to prove that the index is equally or less restrictive than the predicate, so that it contains everything that the predicate is looking for. Be aware that using wildcards in index definitions might inadvertently index more nodes than needed. Wherever possible, it is recommended to use the exact path to the desired elements or attributes in index definitions and queries, without wildcards. Very generic XML index patterns such as //* or //text() are possible, but should be used with great caution. NOTE An index on //* indexes all elements, including non-leaf elements, which is typically not useful. An index on //* contains an index entry even for the root element, and the key value is the concatenation of all text nodes in the document. Such a key value can easily exceed the length constraint of a VARCHAR(n) index so that the document cannot be inserted. 13.3.5 Using Indexes for Structural Predicates As we discussed in sections 6.8, Existential Semantics, and 7.4, Using XPath Predicates in SQL/XML with XMLEXISTS, a structural predicate is one that checks for the existence of an element or attribute irrespective of its value. As an example, let’s find the titles of those documents in the books table that have an explicit publication date; that is, a pubdate element that exists under the book element. The corresponding query is shown in Figure 13.21. The XMLEXISTS predicate evaluates to true if a book element has at least one pubdate child element (existential semantics).
378 Chapter 13 Defining and Using XML Indexes SELECT XMLQUERY('$b/book/title' PASSING bookinfo AS "b") FROM books WHERE XMLEXISTS('$b/book/pubdate' PASSING bookinfo AS "b") Figure 13.21 Query with a structural predicate To support the structural predicate in this query and avoid a table scan, you might want to create a corresponding index of type VARCHAR, as shown in Figure 13.22. CREATE INDEX pubdate_index ON books(bookinfo) GENERATE KEY USING XMLPATTERN '/book/pubdate' AS SQL VARCHAR(12) Figure 13.22 Index for the pubdate element DB2 9 for z/OS is able to use this index to evaluate the structural predicate in Figure 13.21. DB2 for Linux, UNIX, and Windows can also use the index to evaluate the predicate, provided that you rewrite the structural predicate into a suitable value predicate. Figure 13.23 shows a slightly rewritten query that returns all books that have a pubdate element with a value greater than the empty string. -- Query: SELECT XMLQUERY('$b/book/title' PASSING bookinfo AS "b") FROM books WHERE XMLEXISTS('$b/book[pubdate > ""]' PASSING bookinfo AS "b") -- Access Plan: RETURN ( 1) | NLJOIN ( 2) /-+-\ FETCH XSCAN ( 3) ( 7) /----+---\ RIDSCN TABLE: DB2ADMIN ( 4) BOOKS | SORT ( 5) | XISCAN ( 6) | XMLIN: DB2ADMIN PUBDATE_INDEX Figure 13.23 Query with a value predicate to mimic a structural predicate
13.4 XML Indexes and Join Predicates 379 The query uses a value predicate that is eligible to use the index in Figure 13.21. If a document contains an empty pubdate element (<pubdate></pubdate>), the query does not return the title of that document. If you want the predicate to also match empty pubdate elements, change it to [pubdate >= ""]. The access plan in Figure 13.23 confirms that an XML index scan (XISCAN) is used to probe into the PUBDATE_INDEX. This XISCAN produces the row IDs of those documents that match the predicate. The FETCH operator then retrieves only the rows for those matching documents from the table. After that, the XSCAN operator extracts the title elements from the qualifying documents. Note that the nested loop join (NLJOIN) right above the XSCAN is not a typical join operation. It merely facilitates the passing of document pointers to the XSCAN operator. For more background on access plans, see Chapter 14, XML Performance and Monitoring. 13.4 XML INDEXES AND JOIN PREDICATES In section 9.2, Join Queries with XML Data, we explained how to write joins between XML columns. Index usage for join predicates requires special considerations. Let’s look at joining two tables with XML columns. The first one is the books table that we have been using so far, and the second one is a table called authors, containing author information. The authors table is created as follows and contains the two documents in Figure 13.24. CREATE TABLE authors (authorinfo XML) <author id="2001"> <name>Mary Smith</name> <addr country="USA"> <street>555 Bailey Avenue</street> <city>San Jose</city> </addr> <phone> <areacode>408</areacode> <number>4511234</number> </phone> </author> <author id="2002"> <name>Tom Noodle</name> <addr country="Canada"> <street>213 Rigatoni Road</street> <city>Toronto</city> </addr> <phone> <areacode>905</areacode> <number>8110583</number> </phone> </author> Figure 13.24 Sample documents in the authors table
380 Chapter 13 Defining and Using XML Indexes Take a moment to compare the author documents to the book documents in Figure 13.1. You’ll notice that Figure 13.24 contains information about two of the three authors who are referenced in the book documents via their id attributes and names. Therefore, you can join the table authors and books on the author id attribute. Figure 13.25 shows such a join first in XQuery notation and then in the equivalent SQL/XML syntax that also runs on DB2 for z/OS. Both queries return the same result. For each author listed in the authors table, these queries retrieve the isbn numbers of the author’s publications from the books table and return a small document with the isbn number, author id, and name. Although only two authors are listed in the authors table, the queries in Figure 13.25 return three rows because Mary Smith has published two books. xquery for $a in db2-fn:xmlcolumn("AUTHORS.AUTHORINFO")/author for $b in db2-fn:xmlcolumn("BOOKS.BOOKINFO")/book where $a/@id = $b/authors/author/@id return <pub>{$b/@isbn}{$a/@id}{$a/name/text()}</pub>; SELECT XMLELEMENT(name "pub", XMLQUERY('$b/book/@isbn' PASSING bookinfo as "b"), XMLQUERY('$a/author/@id' PASSING authorinfo as "a"), XMLQUERY('$a/author/name' PASSING authorinfo as "a") ) FROM books, authors WHERE XMLEXISTS('$a/author[@id = $b/book/authors/author/@id ]' PASSING bookinfo as "b", authorinfo as "a"); <pub isbn="0-321-18060-7" id="2001">Mary Smith</pub> <pub isbn="0-596-00252-1" id="2001">Mary Smith</pub> <pub isbn="1-59059-983-7" id="2002">Tom Noodle</pub> 3 record(s) selected. Figure 13.25 A join between books and authors in XQuery and SQL/XML notation The join predicates of the queries are highlighted in their where clauses. To improve the performance of these queries, it seems useful to define the indexes in Figure 13.26. Use DECLFOAT instead of DOUBLE on DB2 for z/OS. CREATE INDEX bookAuthorIdx ON books(bookinfo) GENERATE KEY USING XMLPATTERN '/book/authors/author/@id' AS SQL DOUBLE CREATE INDEX authorIdx ON authors(authorinfo) GENERATE KEY USING XMLPATTERN '/author/@id' AS SQL DOUBLE Figure 13.26 Indexes on author id attributes in the books and authors tables
13.4 XML Indexes and Join Predicates 381 However, the queries in Figure 13.25 cannot use either of these indexes to evaluate their join predicates. The reason is that a join predicate does not contain a literal value that indicates the data type of the comparison. Therefore, DB2 needs to look for matching author ids of any data type, not just numeric values. For example, it is possible that an author has a non-numeric id value, such as TN28, in both the authors and the books table. This value would be a valid join match. However, the numeric indexes bookAuthorIdx and authorIdx do not contain alphanumeric values such as TN28. If DB2 used one of these indexes to evaluate the join predicate it would not find author id TN28 and return an incomplete join result. Thus, DB2 cannot use those indexes and resorts to a table scan to ensure a correct query result. Note that changing the index data types to VARCHAR does not help but only reverses the problem. VARCHAR indexes on the author ids allow DB2 to find alphanumeric join matches but might cause DB2 to miss numeric join matches. For example, the values 2001 and 2.001E3 are identical numeric values and should be recognized as a join match. However, the strings “2001” and “2.001E3” are different and not identical in a VARCHAR index. Again, DB2 has no choice but to perform a table scan to guarantee that it catches all possible join matches. In many situations, you probably know that all the values that you join on are of a certain data type. For example, in our book and author example, all the author ids are numeric, so it is perfectly safe to use the numeric indexes in Figure 13.26. In this case, you need to tell DB2 that you want the join to be restricted to numeric comparisons. The way to do this is shown in Figure 13.27, where both sides of the join predicate use a cast to xs:double. The cast explicitly excludes non-numeric matches from the join and allows DB2 to use the DOUBLE index in Figure 13.26. xquery for $a in db2-fn:xmlcolumn("AUTHORS.AUTHORINFO")/author for $b in db2-fn:xmlcolumn("BOOKS.BOOKINFO")/book where $a/@id/xs:double(.) = $b/authors/author/@id/xs:double(.) return <pub>{$b/@isbn}{$a/@id}{$a/name/text()}</pub>; Figure 13.27 XQuery with a join predicate and proper casting You can write the same join in SQL/XML notation in two ways, as shown in Figure 13.28. They differ in the “direction” of the join predicate in the XMLEXISTS. SELECT XMLELEMENT(name "pub", XMLQUERY('$b/book/@isbn' PASSING bookinfo as "b"), XMLQUERY('$a/author/@id' PASSING authorinfo as "a"), XMLQUERY('$a/author/name' PASSING authorinfo as "a") ) FROM books, authors WHERE XMLEXISTS('$b/book/authors[author/@id/xs:double(.) = $a/author/@id/xs:double(.) ]' PASSING bookinfo as "b", authorinfo as "a"); Figure 13.28 SQL/XML query using join predicates
382 Chapter 13 Defining and Using XML Indexes SELECT XMLELEMENT(name "pub", XMLQUERY('$b/book/@isbn' PASSING bookinfo as "b"), XMLQUERY('$a/author/@id' PASSING authorinfo as "a"), XMLQUERY('$a/author/name' PASSING authorinfo as "a") ) FROM books, authors WHERE XMLEXISTS('$a/author[@id/xs:double(.) = $b/book/authors/author/@id/xs:double(.) ]' PASSING bookinfo as "b", authorinfo as "a"); Figure 13.28 SQL/XML query using join predicates (Continued) In DB2 9 for z/OS as well as DB2 9.1 and 9.5 for Linux, UNIX, and Windows, the queries in Figure 13.28 enforce different join orders, so they typically do not perform the same. It’s the join predicate that determines the join order and the performance. In the first query, the predicate in square brackets is applied to the expression starting with $b (column bookinfo). This allows DB2 to use an index to access the books table. As a result, the query performs a table scan on the authors table and then uses the index bookAuthorIdx to probe for matches into the books table. In the second query, the join condition is a predicate on the expression starting with $a (column authorinfo), so DB2 can use an index to access the authors table. It performs a table scan on the books table and then uses the index AuthorIdx to probe into the authors table. You typically want the table scan to be performed on the smaller table. Hence, the choice between these two statements depends on the size of the books and authors tables. If the books table has more rows than the authors table, then the first statement in Figure 13.28 is preferable. DB2 9 for z/OS requires APAR PK55783 to use indexes for the join predicates in Figure 13.28. Also see APAR II14426 for the latest status. NOTE In DB2 9.7 for Linux, UNIX, and Windows, the join order is no longer determined by how the join predicate inside XMLEXISTS is written. The DB2 compiler makes a cost-based decision to choose the appropriate join order. To summarize the advice for XML join queries, you should always cast join predicates to the type of the XML index that should be used. Otherwise query semantics do not allow index usage. If the XML index is defined as DOUBLE or DECFLOAT, cast the join predicate with xs:double. If the XML index is defined as VARCHAR, cast the join predicate with fn:string, and so on as shown in Table 13.5.
13.5 XML Indexes on Non-Leaf Elements Table 13.5 383 Summary of Casting Rules for XML Join Predicates Index Type Cast Join Predicate Using Comment DOUBLE, DECFLOAT xs:double For any numeric predicate VARCHAR(n), VARCHAR HASHED fn:string For any string predicates DATE xs:date For any date predicate TIMESTAMP xs:dateTime For any timestamp predicates 13.5 XML INDEXES ON NON-LEAF ELEMENTS Non-leaf elements are elements that contain other elements. They are not at the bottom of a document tree. In contrast, a leaf element is at the lowest level of the document tree and only contains at most a text node (see section 3.1, Understanding XML Document Trees). In our author documents in this chapter, elements such as addr and phone are non-leaf elements because they contain other elements. The addr element contains the elements street and city, and the phone element contains the elements areacode and number. Let’s remind ourselves of the document structure: <author id="2001"> <name>Mary Smith</name> <addr country="USA"> <street>555 Bailey Avenue</street> <city>San Jose</city> </addr> <phone> <areacode>408</areacode> <number>4511234</number> </phone> </author> In the majority of cases, indexes on non-leaf elements are not useful. For example, it does not make sense to create an index on the non-leaf element /authors/addr. This index has one index entry for the document above because there is one occurrence of the addr element. The XML data model defines the value of a non-leaf element as the concatenation of all text nodes (but not attributes) in the subtree under that element. Therefore, the index entry has the key value “555 Bailey AvenueSan Jose”. Note that there is no space between Avenue and San Jose. Since you normally do not query your data with such concatenated values, the index is typically not helpful. If you need index support for predicates on the street and the city elements of the address, you better define two separate indexes on these two leaf elements.
384 Chapter 13 Defining and Using XML Indexes Now let’s look at a case where an index on a non-leaf element can make sense. For example, assume that queries search authors sometimes by area code and sometimes by their full phone number. In this case, you can define one XML index on the non-leaf element phone (Figure 13.29) and one on the element areacode (Figure 13.30). CREATE INDEX phoneidx ON authors(authorinfo) GENERATE KEY USING XMLPATTERN '/author/phone' AS SQL DOUBLE Figure 13.29 Index on a non-leaf element For our preceding sample document, the value of the phone element is the concatenation of the text nodes of the areacode and number elements: 4084511234. This concatenation is meaningful because the areacode and number elements do not have any further siblings that would contribute to and obscure the concatenated value. CREATE INDEX areaidx ON authors(authorinfo) GENERATE KEY USING XMLPATTERN '/author/phone/areacode' AS SQL DOUBLE Figure 13.30 Index on a leaf element Figure 13.31 contains a predicate on the non-leaf element phone and can use the non-leaf index on /author/phone. SELECT authorinfo FROM authors WHERE XMLEXISTS('$AUTHORINFO/author[phone=4084511234]') Figure 13.31 Query that uses an index on a non-leaf element Figure 13.32 only constrains the areacode element and can use the index in Figure 13.30. SELECT authorinfo FROM authors WHERE XMLEXISTS('$AUTHORINFO/author/phone[areacode=408]') Figure 13.32 Query that uses an index on a leaf element
13.6 Special Cases Where XML Indexes Cannot be Used 13.6 385 SPECIAL CASES WHERE XML INDEXES CANNOT BE USED This section discusses specific situations where XML indexes are not eligible for certain predicates. 13.6.1 Special Cases with XMLQUERY All the guidelines for XML index eligibility discussed in the previous sections apply to both XQuery and SQL/XML queries. Additionally, there are some specific considerations for the SQL/XML functions XMLQUERY and XMLEXISTS. If you use XML predicates in the XMLQUERY function in the SELECT clause of an SQL statement, then these predicates do not eliminate any rows from the result set and therefore cannot use an index. Such predicates only apply to one document at a time and might return a (possibly empty) fragment of a document. Thus, you should place any document and row-filtering predicates into an XMLEXISTS predicate in the WHERE clause of the SQL/XML statement. Figure 13.33 provides an example. -- This query cannot use an index: SELECT XMLQUERY('$BOOKINFO/book[@id = 101]/title') FROM books; -- This query can use an index: SELECT XMLQUERY('$BOOKINFO/book/title') FROM books WHERE XMLEXISTS('$BOOKINFO/book[@id = 101]'); Figure 13.33 13.6.2 Index usage with XMLEXISTS versus XMLQUERY Parent Steps DB2 cannot use an index for predicates that occur under a parent step (“..”), such as the predicates on price in the two queries in Figure 13.34. -- Query 1 SELECT bookinfo FROM books WHERE XMLEXISTS('$BOOKINFO/book/title[../price < 10]'); -- Query 2 xquery for $b in db2-fn:xmlcolumn("BOOKS.BOOKINFO")/book/title where $b/../price < 10 return $b ; Figure 13.34 Queries with parent steps in the predicate don’t use indexes
386 Chapter 13 Defining and Using XML Indexes This is not a significant limitation because you can always express these predicates without the parent axis, as shown in Figure 13.35. -- Query 3: SELECT bookinfo FROM books WHERE XMLEXISTS('$BOOKINFO/book[price < 10]/title'); -- Query 4: xquery for $b in db2-fn:xmlcolumn("BOOKS.BOOKINFO")/book where $b/price < 10 return $b/title ; Figure 13.35 13.6.3 Queries without parent steps in the predicate can use indexes The let and return Clauses Be aware that predicates in XQuery let and return clauses do not filter result sets, and therefore they do not use indexes. The next two queries (Figure 13.36 and Figure 13.37) cannot use an index because an element phone408 needs to be returned for every author, even if it is an empty element for authors outside the 408 area code. xquery for $a in db2-fn:xmlcolumn("AUTHORS.AUTHORINFO")/author let $p := $a/phone[areacode=408]//text() return <phone408>{$p}</phone408> Figure 13.36 No index usage for predicates with let and element construction The second example of a query that doesn’t use an index is Figure 13.37. xquery for $a in db2-fn:xmlcolumn("AUTHORS.AUTHORINFO")/author return <phone408>{$a/phone[areacode=408]//text()}</phone408> Figure 13.37 No index usage for predicates with return and element construction If you want the queries in Figure 13.36 and Figure 13.37 to use an XML index, you need to move the predicate from the let or return clause into the where or for clause (see Figure 13.38). Both queries in Figure 13.38 return the same result, which is different from the results produced by the queries Figure 13.36 and Figure 13.37. The difference is that the queries in Figure 13.38 do not produce empty phone408 elements for customers whose area code is not 408. Instead, such customers are not represented at all in the result.
13.7 XML Index Internals 387 xquery for $a in db2-fn:xmlcolumn("AUTHORS.AUTHORINFO")/author where $a/phone[areacode="408"] return <phone408>{$p//text()}</phone408>; xquery for $a in db2-fn:xmlcolumn("AUTHORS.AUTHORINFO")/author/phone[ areacode="408"] return <phone408>{$p//text()}</phone408>; Figure 13.38 13.7 Queries that can use an index XML INDEX INTERNALS In this section we provide a glimpse at how XML indexes are implemented in DB2 for Linux, UNIX, and Windows. 13.7.1 XML Index Keys In general terms, an XML index is a mapping from path/value pairs to document ID (docID), node ID (nodeID), and row ID (RID). For example, for the first document in Figure 13.1, the index on //@id in Figure 13.39 contains the path/value pairs shown in Figure 13.40. CREATE INDEX idx3 ON books(bookinfo) GENERATE KEYS USING XMLPATTERN '//@id' AS SQL VARCHAR(10) Figure 13.39 Indexing nodes at multiple paths (/book/@id, "101") (/book/authors/author/@id, "2000") (/book/authors/author/@id, "2001") Figure 13.40 Path/Value pairs represented in the XML Index on //@id The index maps each of these pairs to a docID, a nodeID, and a RID. The docID identifies the document that contains the matching node. The nodeID identifies the matching node and region within the document. The RID identifies the row that contains the matching document, similar to RIDs in regular relational indexes. To save space, the XML index contains pathIDs instead of the actual paths. The pathIDs are integers that uniquely identify the actual paths in the XML data. To maintain the mapping from paths to pathIDs, DB2 automatically creates one path index for each XML column. The path
388 Chapter 13 Defining and Using XML Indexes index contains one entry for each distinct path that occurs in the XML column. The index maps each distinct path to a unique pathID. Path indexes tend to be very small since they have only one entry per unique path, even if the table is very large. In the catalog table syscat.indexes, path indexes have the index type XPTH. Table 13.6 shows the logical mapping from paths to pathIDs for the sample data in Figure 13.1. Table 13.6 Logical Mapping from Paths to PathIDs Path PathID /book 100 /book/@id 101 /book/authors 102 /book/authors/author 103 /book/authors/author/@id 104 /book/price 105 /book/pubdate 106 /book/title 107 /book/title/@isbn 108 … … In XPath, the use of the // is fairly common and qualifies the end of the path rather than the beginning. For example, the path //@id identifies all paths that end in “@id”. To find these paths more efficiently, the path index actually stores each path in reverse, from leaf to root, as indicated in Table 13.7. The reverse paths allow DB2 to perform a simple prefix lookup to find all paths that end in “@id”. Table 13.7 Actual Mapping from Paths to pathIDs, Using Reversed Paths Reversed Path PathID author/authors/book/ 100 authors/book/ 101 book/ 102 @id/author/authors/book/ 103 @id/book/ 104 @isbn/title/book/ 105
13.7 XML Index Internals Table 13.7 389 Actual Mapping from Paths to pathIDs, Using Reversed Paths (Continued) Reversed Path PathID price/book/ 106 pubdate/book/ 107 title/book/ 108 … … Based on this mapping, the logical path/value pairs in Figure 13.40 are actually represented as the pathID/value pairs shown in Figure 13.41. The use of pathIDs makes user-defined XML indexes smaller than they would be otherwise. Essentially, the path index acts as a compression dictionary for each user-defined XML index. (104, "101") (103, "2000") (103, "2001") Figure 13.41 pathID/value pairs in the XML index on //@id How are these pathID/value pairs resolved during query processing? Let’s assume a query contains the XPath predicate /book/authors/author[@id="2000"]. DB2 reverses the path to @id/author/authors/book/, performs a lookup in the internal path index, and finds pathID 103. Then DB2 probes the user-defined XML index with the pathID/value pair (103, "2000") to find the matching documents. If a query contains the predicate //@id[. ="2000"], DB2 performs a prefix lookup on the path index and finds pathIDs 103 and 104. Based on that, DB2 probes the user-defined XML index with the pairs (103, "2000") and (104, "2000") but only one of them results in a match. This is because there is an author id with the value 2000, but not a book id with the value 2000. 13.7.2 Logical and Physical XML Indexes When you define an XML index in DB2 for Linux, UNIX, and Windows, DB2 creates two indexes internally: a logical index and a physical index. The logical index carries the index name that you provide in the CREATE INDEX statement and it contains the meta information about the index, such as the XMLPATTERN. The logical index occupies an insignificant amount of space. The physical index has a system-generated name and contains the actual B-tree structure that holds the index keys. The relationship between logical and physical indexes is kept in the catalog view SYSCAT.INDEXXMLPATTERNS (see Chapter 22, Exploring XML Information in the DB2 Catalog).
390 Chapter 13 Defining and Using XML Indexes When you collect statistics for tables and indexes with the RUNSTATS command, note that index statistics are associated with the physical XML index, not the logical index. Since a physical XML index is just a B-tree, the same statistics apply as for a relational B-tree index. For example, you can examine the key cardinalities of an XML index with the query in Figure 13.42. This query joins the catalog view SYSCAT.INDEXES, which contains the key cardinalities, with SYSCAT.INDEXXMLPATTERNS, which maps logical to physical index names. This join allows you to easily examine the key cardinalities based on a logical index name. SELECT x.indname, pattern, firstkeycard AS f1kc, first2keycard AS f2kc, first3keycard AS f3kc, first4keycard AS f4kc, fullkeycard AS fkc FROM syscat.indexes i, syscat.indexxmlpatterns x WHERE i.indname = x.pindname AND x.indname = 'IDX3'; INDNAME PATTERN F1KC F2KC F3KC F4KC ----------- ----------- ------ ------ ------ -----IDX3 //@id 2 6 7 7 1 record(s) selected. Figure 13.42 Key cardinalities of XML index To interpret the key cardinalities, remember that the first four parts of an XML index entry are pathID, value, docID, and nodeID. Hence, the column firstkeycard in the catalog view syscat.indexes contains the number of distinct pathIDs in the index. This number is the cardinality of the first key of the index. For our sample index on //@id and the sample data in Figure 13.1, the firstkeycard value is 2 because the index contains index entries for two paths, /book/@id and /book/authors/author/@id. The column first2keycard indicates the number of unique pathID/value pairs in the index. This number is 6, because our sample data contains six distinct books ids and author ids. The column first3keycard shows the number of distinct pathID/value/docID triplets. In our example, this number is 7 because one pathID/ value pair occurs in two different documents. The author with the id attribute 2001 appears in two of the book documents. The first4keycard and fullkeycard are also 7. 13.8 XML INDEX STATISTICS In this section we look at a more comprehensive example of XML index statistics in DB2 for Linux, UNIX, and Windows. This example is based on the customer table of the DB2 sample database. This table contains six rows with relational cid values from 1000 through 1005, and six corresponding XML documents in the XML column info. Figure 13.43 shows one of these documents. The others can be found in Appendix B, The XML Sample Database.
13.8 XML Index Statistics 391 <customerinfo Cid="1004"> <name>Matt Foreman</name> <addr country="Canada"> <street>1596 Baseline</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M3Z 5H9</pcode-zip> </addr> <phone type="work">905-555-4789</phone> <phone type="home">416-555-3376</phone> <assistant><name>Gopher Runner</name> <phone type="home">416-555-3426</phone> </assistant> </customerinfo> Figure 13.43 A sample document Let’s create two indexes on the info column, as shown in Figure 13.44. The first index is defined on /customerinfo/phone to cover the customer phones only. The second index on //phone contains index entries for all phone elements, including assistant phones. The clause COLLECT DETAILED STATISTICS specifies that index statistics are to be collected during the creation of the index. Alternatively you can run the RUNSTATS command after the creation of the indexes. CREATE INDEX custPhoneIdx ON customer(info) GENERATE KEY USING XMLPATTERN '/customerinfo/phone' AS SQL VARCHAR(50) COLLECT DETAILED STATISTICS CREATE INDEX allPhoneIdx ON customer(info) GENERATE KEY USING XMLPATTERN '//phone' AS SQL VARCHAR(50) COLLECT DETAILED STATISTICS Figure 13.44 Creating the test indexes After creating these two indexes, let’s look at the index statistics. They are visible in the two catalog views, SYSCAT.INDEXES and SYSSTAT.INDEXES. Figure 13.45 retrieves the index key cardinalities together with index names and patterns from SYSSTAT.INDEXES. SELECT SUBSTR(x.indname,1,10) AS log_index, SUBSTR(x.pattern,1,20) AS pattern, i.firstkeycard AS f1kc, i.first2keycard AS f2kc, i.first3keycard AS f3kc, i.first4keycard AS f4kc, i.fullkeycard AS fkc FROM sysstat.indexes i, syscat.indexxmlpatterns x WHERE i.indname = x.pindname; LOG_INDEX PATTERN F1KC F2KC F3KC F4KC FKC ------------- -------------------- ----- ----- ----- ----- ---CUSTPHONEIDX /customerinfo/phone 1 9 11 11 11 ALLPHONEIDX //phone 2 11 13 13 13 2 record(s) selected. Figure 13.45 Examining XML index statistics in the catalog view SYSSTAT.INDEXES
392 Chapter 13 Defining and Using XML Indexes The columns firstkeycard, first2keycard, and so on are interpreted as follows: • firstkeycard: number of distinct pathIDs • first2keycard: number of distinct pathID/value pairs • first3keycard: number of distinct pathID/value/docID triplets • first4keycard: number of distinct pathID/value/docID/nodeID tuples • fullkeycard: number of distinct index entries Since the index custPhoneIdx covers the single XPath /customerinfo/phone, the firstkeycard value is always 1 because the pathID is the same for all index entries. The index allPhoneIdx on //phone has a firstkeycard value of 2, because it includes index entries for phone elements on two different paths, /customerinfo/phone as well as /customerinfo/ assistant/phone. To better understand the values first2keycard, first3keycard, and so on, look at the 13 phone elements that occur in the customer table and their distribution across documents (Table 13.8). Note that the phone numbers #10 and #13 are assistant phone numbers and only occur in the index allPhoneIdx. Hence, first4keycard and fullkeycard are 13 for the index allPhoneIdx but 11 for the index custPhoneIdx. Table 13.8 Summary of phone elements cid Value # All phone Elements 1000 1 <phone type="work">416-555-1358</phone> 1001 2 <phone type="work">905-555-7258</phone> 1002 3 <phone type="work">905-555-7258</phone> 1003 4 5 6 7 <phone <phone <phone <phone 1004 8 9 10 <phone type="work">905-555-4789</phone> <phone type="home">416-555-3376</phone> <phone type="home">416-555-3426</phone> 1005 11 12 13 <phone type="work">905-555-9146</phone> <phone type="home">416-555-6121</phone> <phone type="home">416-555-1943</phone> type="work">905-555-7258</phone> type="home">416-555-2937</phone> type="cell">905-555-8743</phone> type="cottage">613-555-3278</phone>
13.9 Summary 393 The first2keycard of the index custPhoneIdx has the value 9 because out of the eleven index entries there are only nine distinct pathID/value pairs. The phone numbers #2, #3, and #4 have identical values and paths. In other words, three of the six customers have the same work phone number. For the same reason, the first2keycard of the index allPhoneIdx is 11. It contains nine distinct customer phones and two distinct assistant phones. The first3keycard of both indexes equals their fullkeycard because in our example the duplicate phone numbers are all in different documents; that is, they have different docID values in the index. 13.9 SUMMARY XML indexes are essential to ensure good performance for queries, updates, or delete statements that contain predicates on XML columns. The syntax of the CREATE INDEX statement has been extended to let you specify an XMLPATTERN. An XMLPATTERN is a simple XPath expression that selects the elements or attributes that should be indexed. XML indexes are different from relational indexes in several ways. An index on a relational column has exactly one index entry for each row in the table, and the data type of the index key is determined by the type of the column. When you define an XML index on a specific XML element, this element may appear zero, one, or multiple times in any given document. Hence, an XML index can have zero, one, or multiple index entries for each row in the table. As a result, an XML index on an optional element that occurs only in very few documents can be very small and efficient. Another difference to relational indexes is that XML elements and attributes do not necessarily have a predefined data type. Therefore, a target type needs to be specified in the CREATE INDEX statement for an XML index. Since XML indexes only contain values that match the specified XPath and data type, it is not trivial whether an XML index can be used for a given XML predicate or not. Therefore you have to consider the rules for XML index eligibility when you create indexes and write queries. If an XML index is not used when you think it should be used, a common reason is that the data type of the index is not compatible with the comparison operation in the query predicate. Another common reason is that the XMLPATTERN of the index is more restrictive (selects fewer XML nodes) than the XPath in the predicate. Additionally, XML join predicates require casting to a specific data type before they can use an XML index.
This page intentionally left blank
C H A P T E R 14 XML Performance and Monitoring his chapter describes several ways in which you can monitor and analyze the performance of XML operations such as queries, updates, or loading data. We look at query access plans and the statistics you can collect about XML data. Finally, we provide a summary of best practices for XML performance. T Query performance is of particular importance to many applications and is covered in more than one chapter in this book. Guidelines for writing efficient XQuery and SQL/XML queries are provided in Chapters 6 through 9 on querying XML data. Unless explicitly mentioned, the query examples in these chapters reflect best practices for writing queries. The use of XML indexes to improve query performance is discussed in Chapter 13, Defining and Using XML Indexes. Additional guidelines apply to queries and indexes when your XML data contains namespaces. These guidelines are covered in Chapter 15, Managing XML Data with Namespaces. A common theme for all XML queries is that you might have to examine their execution plan to understand or improve their performance. When you run a query against a DB2 database, DB2 first invokes the query compiler and optimizer to generate an efficient execution plan (also called an access plan) for the query. An execution plan consists of a set of operators that DB2 combines to plan the execution of the query. Then the DB2 run-time engine executes this execution plan. The access plan determines to a large degree how efficiently the query is processed. The DB2 explain facility lets you view the execution plan, which allows you to understand how DB2 executes the query and take corrective measures to improve the access plan, if needed. For example, the execution plan tells you which tables are accessed via an index and which tables are scanned. Table scans can often (but not always) be the reason for suboptimal performance. Hence, the analysis of an execution plan can prompt you to revisit the usage of XML indexes for specific tables. 395
396 Chapter 14 XML Performance and Monitoring Note that the execution plan of a query depends on a variety of factors, including: • The volume and characteristics of the data in the table, and the statistics collected with the RUNSTATS command/utility • The existence and characteristics of database objects such as indexes, triggers, constraints, views, and so on • Database and database manager configuration parameters • The way the query is written A change in any of these factors can change the execution plan of the query. This remainder of this chapter covers the following topics: • How to obtain and analyze XML query access plans in DB2 for Linux, UNIX, and Windows (section 14.1) • How to obtain and analyze XML query access plans in DB2 for z/OS (section 14.2) • How to collect statistics for XML data and indexes (section 14.3) • How to monitor XML activity (section 14.4) • A summary of best practices for XML performance in DB2 (section 14.5) 14.1 EXPLAINING XML QUERIES IN DB2 FOR LINUX, UNIX, AND WINDOWS In this section we describe the DB2 explain facility and how you can use it to understand and improve the performance of XML queries. Sections 14.1.1 through 14.1.3 describe the basic usage of the DB2 explain facility. XML-specific query operators and execution plans are discussed in sections 14.1.4 and 14.1.5, respectively. 14.1.1 The Explain Tables in DB2 for Linux, UNIX, and Windows Before you can capture explain information you need to create the explain tables. These are relational tables in which DB2 stores the explain information. To display explain information you can either use the command-line tool db2exfmt (“explain format”) or the Visual Explain tool in the DB2 Control Center or IBM Data Studio Developer. These tools transparently read the explain tables as needed. An advantage of the command-line tool db2exfmt is that all explain information for a given query is written to a single output file that you can easily share with others. For example, the db2exfmt output is the preferred format in which to send explain information to IBM support.
14.1 Explaining XML Queries in DB2 for Linux, UNIX, and Windows 397 If you use the Visual Explain tool, the explain tables are created for you when you use Visual Explain for the first time. If you use the command-line tool db2exfmt then you need to create the explain tables manually before you use db2exfmt for the first time. The DDL statements that create the explain tables are contained in the file EXPLAIN.DDL, which is located in the directory sqllib\misc. To create the explain tables, go to this directory and issue the following command: db2 -tf EXPLAIN.DDL The explain tables are created with a schema of the current DB2 user name, unless you set a specific schema with the SET CURRENT SCHEMA command prior to creating the explain tables. This allows you to control who can use and share the tables. 14.1.2 Using db2exfmt to Obtain Access Plans In DB2 for Linux, UNIX, and Windows you can use the command-line tool db2exfmt to view access plans. There are several ways of doing this, and we present the most common ways here. It’s a two-step process. The first step is to submit a query and capture its access plan information in the explain tables. The second step is to use the db2exfmt tool to read the explain tables and print the execution plan to a file or to the screen. First Step: Capture Access Plan Information in the Explain Tables DB2 has a special register called CURRENT EXPLAIN MODE, which controls the behavior of the explain facility. Its default value is NO, which means that any XQuery or SQL statement executed in the current session is not explained, just executed. You can change the value to YES, to execute an XQuery or SQL query and capture its access plan information in the explain tables. Or you can set it to EXPLAIN to only capture the access plan information in the explain tables without executing the query. Figure 14.1 shows an example of a DB2 Command Line Processor (CLP) session. Note that most people type lowercase commands in the CLP and the commands are not case sensitive, except XQuery expressions. This session begins with setting the explain mode to EXPLAIN. The subsequent SQL/XML query is not executed and only explained; that is, access plan information is inserted into the explain tables. Then the explain mode is disabled, which allows the second invocation of the SQL/XML query to be executed as usual. Figure 14.1 shows SQL/XML statements but the same works for queries in XQuery notation.
398 Chapter 14 XML Performance and Monitoring SET CURRENT EXPLAIN MODE explain; DB20000I The SQL command completed successfully. SELECT cid FROM customer WHERE XMLEXISTS('$INFO/customerinfo/phone[. = "905-555-7258" and @type = "work"]'); SQL0217W The statement was not executed as only Explain information requests are being processed. SQLSTATE=01604 SET CURRENT EXPLAIN MODE no; DB20000I The SQL command completed successfully. SELECT cid FROM customer WHERE XMLEXISTS('$INFO/customerinfo/phone[. = "905-555-7258" and @type = "work"]'); CID ------------1001 1002 1003 3 record(s) selected. Figure 14.1 Submitting a query with different explain modes The SET CURRENT EXPLAIN MODE statement can be embedded in an application program or issued interactively. It’s an executable statement that can be dynamically prepared. Also, you can check the current setting of the explain mode using the command VALUES CURRENT EXPLAIN MODE. An alternative way to capture access plan information in the explain tables is to start the query with the keywords EXPLAIN ALL FOR as in Figure 14.2. This does not execute the query, only captures its access plan. EXPLAIN ALL FOR SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE XMLEXISTS('$INFO/customerinfo/phone[. = "905-555-7258" and @type = "work"]') Figure 14.2 Explaining an SQL/XML query
14.1 Explaining XML Queries in DB2 for Linux, UNIX, and Windows 399 The EXPLAIN ALL FOR statement also works for XQuery, but you need to enclose the XQuery in single quotes and specify the xquery keyword, as in Figure 14.3. EXPLAIN ALL FOR xquery 'for $i in db2-fn:xmlcolumn ("CUSTOMER.INFO")/customerinfo where $i/phone[. = "905-555-7258" and @type = "work"] return $i/name' Figure 14.3 Explaining an XQuery Second Step: Produce the Execution Plan with db2exfmt After the explain tables have been populated, run the db2exfmt utility at the OS prompt to read and format this information to produce a human-readable execution plan. You can produce the execution plan for the most recently explained query (that is, the query for which you have most recently captured access plan information for) by invoking the utility as follows: db2exfmt -d sampxml -1 -o myquery_explain.txt The parameters in the command have the following meaning: • -d provides the database name. • -1 (minus one) indicates that you want to produce a plan for the most recently explained query. • -o (the letter o) to provide the optional output file name. If omitted, the output goes to the screen. It is also possible to produce access plans for previously explained queries. This requires you to use the –w parameter instead of -1, to provide a timestamp of an explained query. This timestamp must match a value in the column EXPLAIN_TIME in the explain table EXPLAIN_STATEMENT. An example of such a command is shown here: db2exfmt -d sampxml -w 2008-12-08-10.11.16.421002 Before we discuss the db2exfmt output, let’s look at how we can obtain execution plans with Visual Explain. Then we discuss the output of db2exfmt and of Visual Explain together.
400 14.1.3 Chapter 14 XML Performance and Monitoring Using Visual Explain to Display Access Plans In DB2 for Linux, UNIX, and Windows you can invoke Visual Explain from the DB2 Control Center or from IBM Data Studio tools, such as Data Studio Developer and Data Studio Administrator. To invoke Visual Explain from the DB2 Control Center, right-click on the database name and select Explain Query…. You can then type, paste, or GET the query that you want to analyze. Clicking OK causes the DB2 compiler to generate an execution plan and store it in the explain tables. The graphical access plan is displayed as shown in the center of Table 14.1. You can click on each operator in the plan to see more information about it. In Data Studio you invoke Visual Explain from the Data Project Explorer pane. In a project connected to a database you should first create an SQL script that contains a query. Then rightclick on the query name and select Visual Explain from the context menu, as shown in Figure 14.4. The access plan is then generated and displayed. These steps are the same whether you are connected to DB2 for z/OS or DB2 for Linux, UNIX, and Windows. Data Studio can produce execution plans for both. Right-click on any node in the explain tree for more detailed numbers. Figure 14.4 Using Visual Explain in Data Studio Developer
14.1 Explaining XML Queries in DB2 for Linux, UNIX, and Windows 401 Table 14.1 shows the execution plan for the query in Figure 14.2 when no eligible indexes are present. The Control Center, Data Studio Developer, and db2exfmt produce slightly different graphical representations of the same access plan. In general, each item in an explain graph is either an operator or a data object such as a table or an index. Understanding access plans starts with understanding the individual operators, which we discuss in the next section. Table 14.1 Access Plan from db2exfmt and Visual Explain Access plan produced Visual Explain (Control Center) by db2exfmt Visual Explain (Data Studio): Total Cost: 60.5308 Rows RETURN ( 1) Cost I/O | 0.06462 NLJOIN ( 2) 60.5308 8 /--+-\ 6 0.01077 TBSCAN XSCAN ( 3) ( 4) 15.1444 7.5644 2 1 | 6 TABLE: DB2ADMIN CUSTOMER 14.1.4 (1)RETURN RETURN(1) 60.53 6 (2)NLJOIN 0.0646264 NLJOIN(3) 60.53 (3)TBSCAN TBSCAN(5) 15.14 XSCAN(7) 7.56 6 (4)XSCAN 0.0107711 (00)CUSTOMER DB2ADMIN DB2ADMIN.CUSTOMER The numbers, such as 15.14 for TBSCAN, indicate the estimated operator cost. The numbers, such as 6 for TBSCAN, show the estimated number of rows. Access Plan Operators An execution plan consists of a set of operators that DB2 combines to execute your query, update, insert, or delete statement. Table 14.2 shows the full list of operators. There are three query operators that process XML documents and indexes, called XSCAN, XISCAN, and XANDOR. Together with the existing operators, they allow DB2 to generate execution plans for XQuery and SQL/XML queries. Table 14.2 Query Operators in DB2 for Linux, UNIX, and Windows Operator Description DELETE Deletes rows from a table. FETCH Fetches rows from a table. FILTER Filters data. (continues)
402 Chapter 14 Table 14.2 XML Performance and Monitoring Query Operators in DB2 for Linux, UNIX, and Windows (Continued) Operator Description GENROW Used by DB2 to generate rows of data. GRPBY Groups rows. HSJOIN Performs a hash join in which the qualified rows from tables are hashed. INSERT Inserts rows into a table IXAND The ANDing of the results of multiple index scans. IXSCAN Scans or probes an index on relational data. MSJOIN Performs a merge-sort join. NLJOIN Performs a nested loop join. RETURN Returns data from a query. RIDSCN Scans a list of row identifiers (RIDs). RPD Retrieves data from a non-relational remote data source. SHIP Retrieves data from a remote data source. SORT Sorts rows or rowIDs from a table. TBSCAN Performs a table scan. TEMP Stores data in a temporary table. TQ A table queue, for parallelization of a query. UNION Concatenates streams of rows from multiple tables. UNIQUE Eliminates rows with duplicate values. UPDATE Updates data in the rows of a table. XANDOR Evaluates multiple predicates simultaneously with two or more XISCAN operators. XISCAN Scans or probes an index on XML data. XSCAN Navigates XML data to evaluate XPath expressions. We describe the three XML-specific operators here and then look at how they work in an execution plan in the next section. • XSCAN (XML Document Scan) DB2 uses the XSCAN operator to traverse XML document trees and, if needed, to evaluate predicates and extract document fragments and values. XSCAN is not an “XML table
14.1 Explaining XML Queries in DB2 for Linux, UNIX, and Windows 403 scan.” The XSCAN operator typically processes one document at a time. For example, it can appear in an execution plan after a table scan to process each of the documents, or in conjunction with an XML index scan to process the documents identified by the index access. • XISCAN (XML Index Scan) Like the existing relational index scan operator for relational indexes (IXSCAN), the XISCAN operator performs lookups or scans on XML indexes. The XISCAN takes a value predicate as input, which is always a path-value pair such as /book[price = 31] or where $i/book/price = 31. The XISCAN returns a set of row IDs and node IDs. The row IDs identify the rows that contain the qualifying documents, and the node IDs identify the qualifying nodes within these documents. The IDs are typically consumed by other operators, such as a FETCH or a XANDOR, as you will see shortly. • XANDOR (XML Index ANDing) The XANDOR operator evaluates two or more equality predicates simultaneously by driving multiple XISCANs. It returns the row IDs of those documents that satisfy all of these predicates. However, DB2 does not use the XANDOR operator for range predicates, or predicates that have a * or // in their XPath. For example, the predicates such as //book[price = 31] and /book[price < 50] prohibit the use of the XANDOR operator. In such cases the IXAND operator is used instead. The IXAND operator is also used for relational index ANDing and for exploiting XML and relational indexes at the same time. Whenever possible, avoid * and // in query predicates to allow the DB2 query optimizer to consider the use of the XANDOR operator. 14.1.5 Understanding and Analyzing XML Query Execution Plans In this section we examine the execution plans for the query in Figure 14.5 when no index, one XML index, or two XML indexes exist that the query can use. This query contains two predicates, one for the value of the phone element, and one for the value of the type attribute of the phone element. SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE XMLEXISTS('$INFO/customerinfo/phone[. = "905-555-7258" and @type = "work"]') Figure 14.5 An SQL/XML query with two predicates Table 14.3 shows the access plan for this query when no suitable indexes are defined on the customer table. The best way to read an access plan is to start at the lowest left-most node in the operator tree, which in this case is the table db2admin.customer. Since no index is available, the customer table is input to the TBSCAN (table scan) operator. The TBSCAN reads all rows from
404 Chapter 14 XML Performance and Monitoring the table. The NLJOIN (nested loop join) operator connects the TBSCAN with an XSCAN. For each row, the NLJOIN operator passes a pointer to the corresponding XML document to the XSCAN operator. This tells the XSCAN which XML documents to operate on. As such, the NLJOIN does not act as a classical join with two input legs, but facilitates access to the XML data for the XSCAN operator. For each document, the XSCAN operator traverses the document tree, evaluates the two predicates, and extracts the name element if the predicates are satisfied. Each name element is passed up through the NLJOIN operator to the RETURN operator. The RETURN operator returns the result set to the calling application. Table 14.3 Access Plan for the Query in Figure 14.5 without Index Usage : Access plan produced by db2exfmt Total Cost: 60.5308 Rows RETURN ( 1) Cost I/O | 0.06462 NLJOIN ( 2) 60.5308 8 /--+-\ 6 0.01077 TBSCAN XSCAN ( 3) ( 4) 15.1444 7.5644 2 1 | 6 TABLE: DB2ADMIN CUSTOMER Visual Explain: RETURN(1) 60.53 NLJOIN(3) 60.53 TBSCAN(5) 15.14 XSCAN(7) 7.56 DB2ADMIN.CUSTOMER In the db2exfmt output, each operator in the plan is represented by five lines. The RETURN operator at the top serves as a legend and always reminds you what the five lines mean (see Figure 14.6). The number above each operator name is the estimated number of rows produced by the operator. The number in parentheses below the operator name is the operator number within the tree. The next two numbers are the estimated cost of the operation in timerons and the estimated number of I/Os that the operator will perform. Beware that the estimated values are typically not accurate if you haven’t recently used the RUNSTATS command to collect statistics on tables and indexes.
14.1 Explaining XML Queries in DB2 for Linux, UNIX, and Windows 405 Estimated number of rows returned by the operator: Rows 6 Operator name: RETURN TBSCAN Unique identifier of the operator in this plan: ( ( Estimated cost of the operator: Cost Estimated I/O cost of the operator: I/O Figure 14.6 1) 3) 15.1444 2 Operator information in an execution plan Let’s see how the access plan can change when an XML index is defined on the type attribute, as shown in Figure 14.7. After defining the index, you should invoke the RUNSTATS command to gather index statistics. Alternatively you can add the COLLECT STATISTICS clause to the CREATE INDEX statement. CREATE INDEX cust_idx1 ON customer(info) GENERATE KEYS USING XMLPATTERN '/customerinfo/phone/@type' AS SQL VARCHAR(50) Figure 14.7 Creating an index for the query in Figure 14.5 Table 14.4 shows the access plan that is obtained after creating the index in Figure 14.7. Again, read the execution plan from the lower-left corner. The XISCAN operator probes the index with the path-value pair (/customerinfo/phone/@type, work) and returns the row IDs for the documents where the phone type is work. These row IDs are sorted to remove duplicates (if any) and to optimize the subsequent I/Os to the table. The RIDSCN operator (row ID scan) then scans these row IDs, triggers row prefetching, and passes the row IDs to the FETCH operator. For each row ID, the FETCH operator reads the corresponding row from the table. The benefit of this plan is that only a fraction of the rows in the table are retrieved; that is, only those where type is work. This is a lot cheaper than a full table scan that reads every row. For each row fetched, the NLJOIN passes a document pointer to the XSCAN operator, which processes the corresponding XML document. It evaluates the predicate on phone and, if the predicate is satisfied, extracts the name element. There might be many documents where this second predicate is not true, so the XSCAN might still perform a lot of work to eliminate them from the result set. Thus, you might see even better performance if the second predicate is also covered by an index.
406 Chapter 14 Table 14.4 XML Performance and Monitoring Access Plan for the Query in Figure 14.5 with XML Index Usage : Access plan produced by db2exfmt Total Cost: Rows RETURN ( 1) Cost I/O | 0.212729 NLJOIN ( 2) 28.394 3.75 /-+-\ 2.75 0.0773558 FETCH XSCAN ( 3) ( 7) 7.5921 7.56433 1 1 /----+---\ 2.75 6 RIDSCN TABLE: DB2ADMIN ( 4) CUSTOMER 0.0267987 0 | 2.75 SORT ( 5) 0.0263833 0 | 2.75 XISCAN ( 6) 0.0250255 0 | 6 XMLIN: DB2ADMIN CUST_IDX1 Visual Explain: 28 RETURN(1) 28.39 NLJOIN(3) 28.39 FETCH(8) 7.59 RIDSCN(10) 0.03 XSCAN(16) 7.56 DB2ADMIN.CUSTOMER SORT(12) 0.03 XISCAN(14) 0.03 CUST_IDX1 DB2ADMIN.CUSTOMER Let’s create a second XML index to index the values of the phone element (see Figure 14.8). To provide DB2 with statistics about this new index, either use the COLLECT STATISTICS clause in the index definition, or run RUNSTATS after creating the index. CREATE INDEX cust_idx2 ON customer(info) GENERATE KEYS USING XMLPATTERN '/customerinfo/phone' AS SQL VARCHAR(50) COLLECT STATISTICS Figure 14.8 Creating a second index for the query in Figure 14.5 An access plan where both XML indexes are used is shown in Table 14.5.
14.1 Explaining XML Queries in DB2 for Linux, UNIX, and Windows Table 14.5 407 Access Plan for Query with Two Indexes Defined on the Table Access plan produced by db2exfmt : 11.8536 Rows RETURN ( 1) Cost I/O | 0.212729 NLJOIN ( 2) 11.8536 1.56019 /--+-\ 0.560185 0.379747 FETCH XSCAN ( 3) ( 9) 4.28921 7.5644 0.560185 1 /----+---\ 0.560185 6 RIDSCN TABLE: DB2ADMIN ( 4) CUSTOMER 0.0513969 0 | 0.560185 SORT ( 5) 0.0509815 0 | 0.560185 XANDOR ( 6) 0.0500511 0 /-----+-----\ 1.22222 2.75 XISCAN XISCAN ( 7) ( 8) 0.0250255 0.0250255 0 0 | | 6 6 XMLIN: DB2ADMIN XMLIN: DB2ADMIN CUST_IDX2 CUST_IDX1 Visual Explain: Total Cost: RETURN(1) 11.85 NLJOIN(3) 11.85 FETCH(10) 4.29 XSCAN(22) 7.56 RIDSCN(12) 0.05 DB2ADMIN.CUSTOMER SORT(14) 0.05 XANDOR(16) 0.05 XISCAN(18) 0.03 CUST_IDX2 DB2ADMIN.CUSTOMER XISCAN(20) 0.03 CUST_IDX1 DB2ADMIN.CUSTOMER The execution plan in Table 14.5 contains two XISCAN operators (XML index scans), one for each XML predicate. The XANDOR operator uses these XISCANs to alternately probe into the two indexes to efficiently find the row IDs of the documents that match both predicates. The FETCH operator then only retrieves these rows, thus minimizing I/O to the table. The rest of the query execution works as in the previous plan in Table 14.4. The XANDOR is an XML-specific operator that efficiently computes the intersection between multiple equality predicates. It is important to understand that the existence of two eligible indexes does not automatically imply that both indexes are always used. After the DB2 compiler has identified that the two
408 Chapter 14 XML Performance and Monitoring indexes can be used, the DB2 optimizer makes a cost-based decision to determine whether both indexes should be used. For example, the optimizer might choose the plan with one index over the plan with two indexes if the second index does not significantly reduce the number of rows fetched from the table. In such a case the cost of accessing the second index can be greater than the savings in I/O to the table. Based on statistics gathered with the RUNSTATS command, the optimizer tries to detect such cases and uses the execution plan that it deems most efficient. Remember that DB2 does not use the XANDOR operator if the XPath expressions in the predicates include wildcards (// , * ) or if at least one of the indexes evaluate a range comparison (such as > or <). In such cases you will see the IXAND operator (index ANDing) instead of the XANDOR. Logically, both perform the same job but for different types of predicates and with different runtime optimizations. DB2 can also use the IXAND operator to perform index ANDing across XML and relational indexes. For example, the query in Figure 14.9 is similar to the one in Figure 14.5 and contains the additional relational predicate cid < 1002. SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE cid < 1002 AND XMLEXISTS('$INFO/customerinfo/phone[. = "905-555-7258" and @type = "work"]') Figure 14.9 Sample query with an additional relational predicate Table 14.8 shows the execution plan for the query in Figure 14.9 (with some cost estimates removed for brevity). Note the IXAND operator, which computes the intersection of the row IDs produced by one relational index scan (IXSCAN) and two XML index scans (XISCAN). As a result, the FETCH operator reads only those rows from the table that fulfill all three predicates. Again, whether DB2 decides to use all three indexes or only a subset of them depends, among other things, on the actual data in the table and the statistics that have been collected with RUNSTATS.
14.2 Explaining XML Queries in DB2 for z/OS Table 14.6 409 Index ANDing between XML and Relational Indexes : Access plan produced by db2exfmt 9.04372 Rows RETURN ( 1) Cost I/O | 0.0709095 NLJOIN ( 2) /--+-\ 0.186728 0.379747 FETCH XSCAN ( 3) ( 10) /----+---\ 0.186728 6 RIDSCN TABLE: DB2ADMIN ( 4) CUSTOMER | 0.186728 SORT ( 5) | 0.186728 IXAND ( 6) | +---------------+------------+ 2 1.22222 2.75 IXSCAN XISCAN XISCAN ( 7) ( 8) ( 9) | | | 6 6 6 INDEX: XMLIN: XMLIN: DB2ADMIN DB2ADMIN DB2ADMIN PK_CUSTOMER CUST_IDX2 CUST_IDX1 Visual Explain: Total Cost: 14.2 RETURN(1) 9.04 NLJOIN(3) 9.04 FETCH(11) 1.48 XSCAN(25) 7.56 RIDSCN(13) 0.07 DB2ADMIN.CUSTOMER SORT(15) 0.07 IXAND(17) 0.07 XISCAN(19) 0.01 XISCAN(21) 0.03 XISCAN(23) 0.03 PK_CUSTOMER CUST_IDX2 CUST_IDX1 DB2ADMIN.CUSTOMER DB2ADMIN.CUSTOMER DB2ADMIN.CUSTOMER EXPLAINING XML QUERIES IN DB2 FOR Z/OS This section describes how to obtain and interpret access plans for XML queries in DB2 for z/OS. 14.2.1 The Explain Tables in DB2 for z/OS There are four explain tables in DB2 for z/OS (shown in Table 14.7), and they can be created in several ways. You can either use member DSNTESC of the SDSNSAMP library, or you can use the DB2 Administration Tool – option E from the main menu. (Note that the name of the cost estimates table is DSN_STATEMNT_TABLE and not DSN_STATEMENT_TABLE.) Alternatively you can use Visual Explain, which creates the explain tables automatically.
410 Chapter 14 Table 14.7 XML Performance and Monitoring Explain Tables in DB2 for z/OS Table Name Description PLAN_TABLE Access path information for SQL statements, plans, and packages DSN_STATEMNT_TABLE Cost estimates for SQL statements DSN_STATEMENT_CACHE_TABLE Statements in the dynamic statement cache DSN_FUNCTION_TABLE User-defined functions in SQL statements 14.2.2 Obtaining Access Plan Information in SPUFI To gather explain information in SPUFI, prefix the query with the keywords EXPLAIN PLAN SET QUERYNO, as shown in Figure 14.10. This assigns a number to the query, and this number is used as a key in the explain tables, which helps you find the access plan information for this specific query in the explain tables. EXPLAIN PLAN SET QUERYNO = 115 FOR SELECT XMLQUERY('$i/customerinfo/name' PASSING info AS "i") FROM customer WHERE XMLEXISTS('$i/customerinfo/phone[. = "905-555-7258" and @type = "work"]' PASSING info AS "i") Figure 14.10 Explaining a query in SPUFI After running the EXPLAIN statement in Figure 14.10, use the SQL statement in Figure 14.11 to read the explain tables and obtain information about the execution of the query. This query reads the table PLAN_TABLE and selects all rows for the query with a QUERYNO value of 115. The column ACCESSTYPE describes the method that DB2 has chosen to access the customer table. The relevant ACCESSTYPE values for XML queries are • DX—An XML index scan on the index that is named in ACCESSNAME. It returns a list of document identifiers (DOCIDs). • DI—An intersection of multiple DOCID lists to return the final DOCID list. • DU—A union of multiple DOCID lists to return the final DOCID list. • M—Multiple index scans followed by an intersection or union of RID lists.
14.2 Explaining XML Queries in DB2 for z/OS 411 SELECT queryno, SUBSTR(tname,1,8) AS tname, accesstype, SUBSTR(accessname,1,10) AS accessname, indexonly FROM plan_table WHERE queryno = 115 ; ---------+---------+---------+---------+---------+------QUERYNO TNAME ACCESSTYPE ACCESSNAME INDEXONLY ---------+---------+---------+---------+---------+------115 CUSTOMER DX CUST_IDX1 N Figure 14.11 Querying the DB2 for z/OS PLAN_TABLE Figure 14.12 shows how to examine the DSN_STATEMNT_TABLE table, which contains information about the estimated cost of SQL statements. You should check the value of the column COST_CATEGORY. If the value is B, then one of the possible reasons is that the table cardinality is missing. In this case, ensure that you have run the RUNSTATS utility against both the table space containing the base table and the XML table space. SELECT queryno, explain_time, stmt_type, cost_category, SUBSTR(reason,1,20) AS reason, total_cost FROM dsn_statemnt_table WHERE queryno = 115 ; --------+---------+---------+---------+---------+---------+----QUERYNO EXPLAIN_TIME STMT_TYPE COST_CATEGORY --------+---------+---------+---------+---------+---------+----115 2008-12-03-11.51.08.260000 SELECT A . . . . . . . . .--+---------+---------+---------+---------+-.REASON TOTAL_COST .--+---------+---------+---------+---------+-. +0.9539343662063281E+02 Figure 14.12 14.2.3 Querying the DB2 for z/OS DSN_STATEMNT_TABLE Using Visual Explain to Display Access Plans You can obtain a visual representation of the access plan using either IBM Data Studio Developer or IBM DB2 Optimization Service Center for DB2 for z/OS (OSC). Using Visual Explain in Data Studio Developer is the same in DB2 for z/OS as for DB2 for Linux, UNIX, and Windows and was discussed in section 14.1.3. For DB2 for z/OS you can invoke Visual Explain from the OSC by double-clicking on a project folder and then clicking on Identify Target Query, as shown in Figure 14.13.
412 Figure 14.13 Chapter 14 XML Performance and Monitoring Invoking Visual Explain from the OSC The next step is shown in Figure 14.14. Make sure the Query source pull-down list specifies Text, then paste your query into the Query text panel and select Access Plan Graph from the Tools pull-down menu. Figure 14.14 Invoking Visual Explain from the OSC
14.2 Explaining XML Queries in DB2 for z/OS 413 The tool generates an access plan, such as the one in Figure 14.16. For each operator node in the access graph, the small number in parentheses to the left of the operator name is the operator ID. The value under the operator name is the cardinality. 14.2.4 Access Plan Operators An access plan consists of a set of operators that DB2 combines to execute a query. Table 14.8 shows the query operators in DB2 for z/OS. Four query operators are specific to the processing of XML documents and indexes: DIXSCAN, XIXAND, XIXOR, and XIXSCAN. Table 14.8 Internal Query Operators for DB2 for z/OS Operator Description BTBSCAN A buffer table scan. CORSUB ACCESS Access by a correlated subquery. DELETE Deletes selected rows from a table or a deletable view. DFETCH An operation called a direct fetch. DIXSCAN DOCID index access. Returns a RID for a given DOCID. EXCEPTA An EXCEPT ALL operation. EXCEPT The EXCEPT operation. FETCH DB2 fetches rows from a table using the RIDs from an IXSCAN or MIXSCAN. FFETCH Uses a fact table index to fetch fact table data in a pushdown star join. FIXSCAN Scans a fact table index during a pushdown star join. INTERSECTA The INTERSECT ALL operation. INTERSECT The INTERSECT operation. INSERT Indicates the insertion of rows into a table or an insertable view. IXAND Returns the intersection of two sorted ROWID lists. IXOR Returns the union of two sorted ROWID lists. IXSCAN A single-index scan. MERGE Merges multiple data streams into one data stream. MIXSCAN A multiple-index scan. PARTITION Separates one data stream into multiple data streams. (continues)
414 Chapter 14 Table 14.8 XML Performance and Monitoring Internal Query Operators for DB2 for z/OS (Continued) Operator Description QB n Denotes a query block (subquery), where n is the query block number. RID FETCH RID fetch access. SIXSCAN Sparse index scan. SORTRID Sorts the qualified index entries that result from an index scan. TRUNCATE Truncates a table; that is, deletes all rows. UNION The union of the results from two SELECT statements to form a single result table that contains no duplicate rows. UNIONA The union of the results from two SELECT statements to form a single result table that might contain duplicate rows. UPDATE Updates of one or more columns of the selected rows in a table. WFSCAN Scans a work file. XIXAND Returns the intersection of two sorted DOCID lists. Only those DOCIDs that exist in both DOCID lists are included in the output. XIXOR Returns the union of two sorted DOCID lists. Any DOCID that exists in at least one of the DOCID lists is included in the output. Duplicate DOCIDs are removed. XIXSCAN XML index access, returns the DOCID and NODEID pairs for a given key value. 14.2.5 Understanding and Analyzing XML Query Execution Plans As an example, consider the query in Figure 14.15, which contains two predicates, one for the value of the phone element, and one for the value of the type attribute of the phone element. Let’s examine the access plan for this query when no index, one XML index, or two XML indexes exist that support the predicates in this query. SELECT cid FROM customer WHERE XMLEXISTS('$i/customerinfo/phone[. = "905-555-7258" and @type = "work"]' PASSING info AS "i") Figure 14.15 Sample query with two predicates The left side of Figure 14.16 shows the access plan for this query when there is no eligible index. In this case, DB2 performs a table scan. For each XML document in the info column, DB2 evaluates the XML predicates and returns the value of the cid column if the predicates are true.
14.2 Explaining XML Queries in DB2 for z/OS 415 The right side of Figure 14.16 shows the access plan for the query in Figure 14.15 when there is an XML index for one of the two predicates in the query. It could be the index on /customerinfo/phone/@type that was defined earlier in Figure 14.17. Read the access plan from the lowest leftmost node in the graph going upwards. The XIXSCAN operator (XML index scan) probes the index CUST_IDX1 with the value "work" and returns the document identifiers (DOCIDs) of all documents that match the predicate. This list of DOCIDs is input to the DIXSCAN operator. For each DOCID, the DIXSCAN operator probes the DOCID index and returns the RID (row identifier) of the corresponding base table row. That is the row that the matching XML document logically belongs to. The FETCH operator uses this RID to fetch the identified row from the customer table. (1)QUERY (2)QB1 192 (1)QUERY (3)FETCH 192 (2)QB1 384 (3)TBSCAN (4)DIXSCAN (8)CUSTOMER 384 192 768.0 (4)CUSTOMER 768.0 (5)XIXSCAN (7)DOCID Index 192 (6)CUST_IDX 1408 Figure 14.16 Access plan without XML index (left) and with one XML index (right) When two XML indexes are available, one for the predicate on type and one for the predicate on phone (Figure 14.7 and Figure 14.8), DB2 may generate the access plan shown in Figure 14.17. The plan contains two XIXSCAN operators, one for each predicate and corresponding index. Each XIXSCAN produces a list of DOCIDs for those documents that match the respective predicate. The XIXAND operator (XML index ANDing) computes the intersection of the two DOCID lists that it receives as input. It produces a single DOCID list that contains the DOCIDs of those documents that match both predicates. These DOCIDs are used by the DIXSCAN operator to obtain corresponding base table RIDs from the DOCID index. The MIXSCAN operator indicates that this part of the access plan is a multiple-index access construct. Finally, the FETCH operator uses the generated RIDs to read those rows from the customer table that have XML documents that match both predicates.
416 Chapter 14 XML Performance and Monitoring (1)QUERY (2)QB1 21 (3)FETCH 21.3333 (4)MIXSCAN (12)CUSTOMER 192 768.0 (5)DIXSCAN 21.3333 (6)XIXAND (11)DOCID Index 21.3333 Figure 14.17 (7)XIXSCAN (9)XIXSCAN 21.3333 21.3333 (8)CUST_IDX2 (10)CUST_IDX2 1408.0 1408.0 Access plan with two XML indexes Figure 14.18 shows the rows in the PLAN_TABLE table for the same query and access plan as in Figure 14.17. The ACCESSTYPE indicators DX correspond to the XIXSCAN nodes in Figure 14.17. The ACCESSTYPE indicator DI represents the DIXSCAN and M indicates the multiple-index access construct. SELECT queryno, SUBSTR(tname,1,8) AS tname, accesstype, SUBSTR(accessname,1,10) AS accessname, indexonly FROM plan_table WHERE queryno = 130 ; ---------+---------+---------+---------+---------+---QUERYNO TNAME ACCESSTYPE ACCESSNAME INDEXONLY ---------+---------+---------+---------+---------+---130 CUSTOMER M N 130 CUSTOMER DX CUST_IDX2 Y 130 CUSTOMER DX CUST_IDX1 Y 130 CUSTOMER DI N DSNE610I NUMBER OF ROWS DISPLAYED IS 4 Figure 14.18 Plan table information for an access plan with two XML indexes
14.3 Statistics Collection for XML Data 14.3 417 STATISTICS COLLECTION FOR XML DATA The access plan that DB2 chooses for a given query is determined, among other things, by the characteristics (statistics) of the data in the table and by the presence of eligible indexes. Statistics can be collected with the RUNSTATS utility, which also collects statistics for XML data. When you use the RUNSTATS utility you can choose to include or exclude XML data in the statistics collection process. It is generally recommended to include XML to provide the DB2 optimizer with maximum information. If you know that many UPDATE statements have modified the relational columns of your table but not the XML column(s), you might prefer to refresh the statistics for the relational columns only. The RUNSTATS utility typically completes faster if XML columns are excluded. 14.3.1 Statistics Collection for XML Data in DB2 for z/OS When you use the RUNSTATS utility in DB2 for z/OS, you can specify which table spaces and indexes to include. This gives you the option to run the utility for the base table space only, for the XML table space only, or for both. To ensure the best possible query access plan, it is recommended to run RUNSTATS on the base table space and the XML table space and all related indexes. Figure 14.19 shows an example of a RUNSTATS job for the customer table. The customer table is in table space DSN00201.CUSTOMER and the associated XML table space is DSN00201. XCUS0000. You can find the name of the XML table space with the REPORT TABLESPACESET utility (see section 3.12, Utilities for XML Objects in DB2 for z/OS). Also, you can use the LISTDEF utility to group these database objects together into a list and then specify that list in the RUNSTATS control statement. //RUNSTATS EXEC DSNUPROC,PARM='ISC9,PKCTEX',COND=(4,LT) //SORTLIB DD DSN=SYS1.SORTLIB,DISP=SHR //SORTOUT DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //DSNTRACE DD SYSOUT=* //SYSUT1 DD UNIT=SYSDA,SPACE=(4000,(50,50),,,ROUND) //SYSIN DD * RUNSTATS TABLESPACE DSN00201.CUSTOMER TABLE (ALL) INDEX (ALL) KEYCARD REPORT YES RUNSTATS TABLESPACE DSN00201.XCUS0000 TABLE (ALL) INDEX (ALL) REPORT YES /* Figure 14.19 Running the RUNSTATS utility on DB2 for z/OS When you run the RUNSTATS TABLESPACE utility on an XML table space, the keywords COLGROUP, FREQVAL, and HISTOGRAM are ignored. The RUNSTATS INDEX utility also ignores the keywords KEYCARD, FREQVAL, and HISTOGRAM for XML indexes and NodeID indexes.
418 Chapter 14 XML Performance and Monitoring Statistics about user-defined XML indexes and the internal DOCID and NodeID indexes are similar to relational indexes and kept in the catalog table SYSIBM.SYSINDEXSTATS. For example, the FIRSTKEYCARD of a user-defined XML index indicates the number of distinct values in the indexed XML element or attribute. The XML index on /customerinfo/phone/@type has a FIRSTKEYCARD value of 3, if the only values that appear in the type attribute are work, home, or cell. The FIRSTKEYCARD of a user-defined XML index can be used as the COLCARD (column cardinality) in the estimation of the filter factor of an XMLEXISTS predicate. Similarly, the FIRSTKEYCARD of a NodeID index provides the number of distinct DOCID values in the corresponding XML table. These statistics allow DB2 to translate the filter factor of an XMLEXISTS predicate on the internal XML table into a filter factor of the base table. Both filter factors help DB2 to assess the cost of potential access plans and to choose the plan with the lowest estimated cost. 14.3.2 Statistics Collection for XML Data in DB2 for Linux, UNIX, and Windows In DB2 for Linux, UNIX, and Windows, the default behavior of the RUNSTATS command is to collect statistics for all relational and XML columns of a table. Optionally, you can choose to exclude XML columns and collect statistics for relational columns only. You can also gather statistics for the indexes of a table only, which ignores all XML and relational columns in the table. Yet other options allow you to run RUNSTATS for individual columns or indexes only, by providing the column or index names. The syntax of the RUNSTATS command for some of these options is shown in Figure 14.20. -- Collect statistics for XML and relational columns: RUNSTATS ON TABLE db2admin.customer; -- Collect statistics for relational columns only: RUNSTATS ON TABLE db2admin.customer EXCLUDING XML COLUMNS; -- Collect statistics for the XML column “info” only: RUNSTATS ON TABLE db2admin.customer ON COLUMNS(info); -- Collect statistics for the XML index “cust_idx1” only: RUNSTATS ON TABLE db2admin.customer FOR INDEX db2admin.cust_idx1; -- Collect statistics for XML and relational indexes only: RUNSTATS ON TABLE db2admin.customer FOR INDEXES ALL; -- Collect detailed statistics for XML and relational indexes: RUNSTATS ON TABLE db2admin.customer FOR DETAILED INDEXES ALL; -- Collect detailed statistics for all columns and all indexes: RUNSTATS ON TABLE db2admin.customer WITH DISTRIBUTION AND DETAILED INDEXES ALL; Figure 14.20 Useful options of the RUNSTATS command
14.3 Statistics Collection for XML Data 419 The last RUNSTATS command in Figure 14.20 collects the most information and gives DB2 the best basis for generating efficient execution plans. But, if you have added a new index to a table with otherwise up-to-date statistics, it is sufficient to collect statistics only for that new index. For relational data as well as XML data you can enable sampling to reduce the time for executing RUNSTATS. On a large data set, the statistics from 10% of the data (or even less) are often still representative of the total population. Whatever sampling percentage you choose, RUNSTATS allows you to sample rows (Bernoulli sampling) or pages (system sampling). Row-level sampling reads all data pages but considers only a percentage of the rows on each page. Page-level sampling reduces I/O since it reads only a percentage of the data pages. Thus, page sampling can improve performance, especially if XML data is inlined into the data pages of a table. Figure 14.21 shows examples of RUNSTATS with sampling. The first RUNSTATS command collects detailed index statistics, but for table statistics it samples 10% of the pages. In many cases, this provides the optimizer with reasonably accurate statistics and completes much faster than without sampling. The second command samples 15% of all rows, does not collect distribution statistics, and also applies sampling to the computation of extended index statistics. RUNSTATS ON TABLE myschema.customer WITH DISTRIBUTION AND DETAILED INDEXES ALL TABLESAMPLE SYSTEM (10); RUNSTATS ON TABLE myschema.customer AND SAMPLED DETAILED INDEXES ALL TABLESAMPLE BERNOULLI (15); Figure 14.21 14.3.3 RUNSTATS with sampling Examining XML Statistics with db2cat In DB2 for Linux, UNIX, and Windows, statistics for XML indexes are very similar to statistics for relational indexes and visible in the catalog views SYSCAT.INDEXES and SYSSTAT. INDEXES. XML index statistics are explained in section 13.8, XML Index Statistics. In DB2 for Linux, UNIX, and Windows, statistics for XML columns are stored differently from statistics for relational columns. While relational statistics are kept in catalog tables, XML statistics are stored internally in the packed descriptor of the user table that you collect statistics for. As a result, you cannot see XML statistics in any catalog views and you cannot manually modify them. However, the “Database Catalog Analysis and Repair Tool” (db2cat) allows you to read the XML statistics from a table’s packed descriptor and write them to a text file. For example, the following command writes the XML statistics from the customer table in database sampxml and schema db2admin to the file xmlstats.txt: db2cat -d sampxml -n customer -s db2admin -o xmlstats.txt
420 Chapter 14 XML Performance and Monitoring In the output file, search for the string XML column statistics to get to the XML statistics. Figure 14.22 shows excerpts from the output file after db2cat was run on the customer table with 6,144 documents and sampled statistics. We simply inserted the six original customer documents 1,024 times. The db2cat output contains six sections: 1. General counters, such as number of documents, paths, and nodes, as well as minimum, maximum, and average document size, and so on 2. Top-k Pathid node counts 3. Top-k Pathid doc counts 4. Top-k Pathid-Value node counts 5. Top-k Pathid-Value doc counts 6. Catch All Pathid-Value Bucket In these sections Top-k means the top k number of occurrences, where k has a default value of 200. The node count of a given path or path-value pair is the number of nodes that match the path or path-value pair in the XML column. The document count of a path or path-value pair refers to the number of documents that contain the given path or path-value pair at least once. Examining these statistics does not always lead to immediate actions that you can perform to improve query performance. However, understanding the statistics gives you a glimpse of the information that the DB2 optimizer considers for XML query optimization. It can also reveal characteristics of your XML data that you didn’t know. More specifically, the db2cat output shows the following: • No. NULL XML docs: The number of NULL values in the XML column. • No. non-NULL XML docs: The number XML documents in the XML column. • No. inlined docs: Number of documents that are inlined into the base table rows when an INLINE LENGTH has been specified for the XML column in the CREATE TABLE statement. • Distinct Pathid count: The number of distinct paths that occur in the documents in the XML column. • Sum Node Counts: The total number of nodes (elements, attributes, and so on) across all documents in the column. • Sum Doc Counts: The sum of the document counts of all distinct paths. For example, suppose the documents contain 10 distinct paths and 4 of these paths occur in 100 documents, and the other 6 paths occur in 200 documents each. Then Sum Doc Counts has the value 1600; that is (4 × 100) + (6 × 200). As another example, if all distinct paths occur at least once in each document, then Sum Doc Counts = No. non-NULL XML docs x Distinct Pathid count.
14.3 Statistics Collection for XML Data 421 Note that if each distinct path occurs exactly once per document then Sum Node Counts equals Sum Doc Counts. If Sum Node Counts is much greater than Sum Doc Counts then this indicates that the documents contain many repeating elements. • Top-k Pathid node counts: The top k most frequent paths (represented by their path IDs) and the number of nodes that each of these paths identify. For each path, this is the number of times that the path occurs in the XML data and it includes any repeated occurrences within any documents. The value k determines how many frequent paths are included in the statistics, similar to the NUM_FREQVALUES parameter for relational data. The default value for k is 200, but can be changed with the DB2 registry variable DB2_XML_RUNSTATS_PATHID_K. We suggest that you do not change this value unless advised by IBM support. • Top-k Pathid doc counts: The top k paths that occur in the most documents, together with the number of documents that they occur in. If a path occurs multiple times in a document, then this document is counted only once. As an example, consider the path /customerinfo/phone. Suppose this path appears in the Top-k Pathid doc counts with a document count of 100, and in the Top-k Pathid node counts with a node count of 200. Then DB2 can deduce that each document contains on average two phone elements. This is valuable information, for example, to estimate the cost of navigating a document with a certain path, or to estimate the number of rows produced by an XMLTABLE function. • Top-k Pathid-Value node counts: The top k most frequent path-value pairs and the number of times they occur. Consider again the path /customerinfo/phone as an example. Since most phone numbers in the customer data are distinct, this path is unlikely to appear among the most frequent path-value pairs. Elements that have a small number of distinct values are more likely to appear. For example, the path-value pair (/customerinfo/phone/@type, "work") can easily be among the most frequent pairs, because most customers in our sample data have a work phone number. • Top-k Pathid-Value doc counts: The top k path-value pairs that occur in the most documents, and the number of documents that they occur in. If a path-value pair occurs multiple times in a given document, then this document is counted only once. The value k, which determines how many frequent path-value pairs are collected, is 200 by default but can be changed with the DB2 registry variable DB2_XML_RUNSTATS_ PATHVALUE_K. • Catch All Pathid-Value Bucket: For each distinct path that leads to a text node or attribute, the statistics include the number of distinct values on this path, the second highest and lowest value on this path, as well as the node and document count. Depending on your XML data, this catch-all can contain thousands of entries, which is no reason for concern.
422 Chapter 14 ++++++++++++++++++++++++++++++++++++++++ XML column statistics ++++++++++++++++++++++++++++++++++++++++ Column ID = 1 No. NULL XML docs = 0 No. non-NULL XML docs = 6144 Smallest XML doc size = 422 Largest XML doc size = 678 Avg XML doc size = 544 No. inlined docs = 0 ---------------------------------------Catch All Pathid Bucket ---------------------------------------Distinct Pathid count = 23 Sum Node Counts = 132096 Sum Doc Counts = 116736 ---------------------------------------Top-k Pathid node counts ---------------------------------------Max no. of path counts = 23 Cur no. of path counts = 23 Cnt( /root()/customerinfo/phone/text() ) = 11690 Cnt( /root()/customerinfo/phone/type ) = 11228 Cnt( /root()/customerinfo/phone ) = 10951 Cnt( /root()/customerinfo ) = 6144 Cnt( /root()/customerinfo/addr/country ) = 6658 Cnt( /root()/customerinfo/name/text() ) = 6526 Cnt( /root()/customerinfo/addr/pcode-zip/text() ) = 6354 Cnt( /root()/customerinfo/addr/city ) = 6314 Cnt( /root()/customerinfo/addr/street ) = 6248 Cnt( /root()/customerinfo/addr/city/text() ) = 6142 Cnt( /root()/customerinfo/addr/pcode-zip ) = 5931 Cnt( /root()/customerinfo/addr ) = 5905 Cnt( /root()/customerinfo/addr/street/text() ) = 5905 Cnt( /root()/customerinfo/addr/prov-state ) = 5891 Cnt( /root()/customerinfo/Cid ) = 5865 Cnt( /root()/customerinfo/addr/prov-state/text() ) = 5746 Cnt( /root()/customerinfo/name ) = 5733 Cnt( /root()/customerinfo/assistant/name ) = 2338 Cnt( /root()/customerinfo/assistant/name/text() ) = 2140 Cnt( /root()/customerinfo/assistant ) = 2127 Cnt( /root()/customerinfo/assistant/phone/text() ) = 1981 Cnt( /root()/customerinfo/assistant/phone/type ) = 1902 Cnt( /root()/customerinfo/assistant/phone ) = 1797 ---------------------------------------Top-k Pathid doc counts ---------------------------------------Max no. of path counts = 23 Cur no. of path counts = 23 Cnt( /root()/customerinfo/name/text() ) = 6144 Cnt( /root()/customerinfo/addr/city ) = 6144 Cnt( /root()/customerinfo/phone/type ) = 6144 Cnt( /root()/customerinfo/addr/country ) = 6144 Cnt( /root()/customerinfo ) = 6144 Cnt( /root()/customerinfo/addr ) = 6144 Figure 14.22 Output of the db2cat utility XML Performance and Monitoring
14.3 Statistics Collection for XML Data Cnt( /root()/customerinfo/phone ) = 6144 Cnt( /root()/customerinfo/addr/prov-state ) = 6144 Cnt( /root()/customerinfo/addr/pcode-zip ) = 6117 Cnt( /root()/customerinfo/addr/city/text() ) = 6105 Cnt( /root()/customerinfo/name ) = 6094 Cnt( /root()/customerinfo/addr/prov-state/text() ) = 6070 Cnt( /root()/customerinfo/addr/pcode-zip/text() ) = 6035 Cnt( /root()/customerinfo/Cid ) = 5895 Cnt( /root()/customerinfo/addr/street/text() ) = 5872 Cnt( /root()/customerinfo/phone/text() ) = 5813 Cnt( /root()/customerinfo/addr/street ) = 5650 Cnt( /root()/customerinfo/assistant ) = 2370 Cnt( /root()/customerinfo/assistant/name ) = 2171 Cnt( /root()/customerinfo/assistant/phone/text() ) = 2136 Cnt( /root()/customerinfo/assistant/phone ) = 1973 Cnt( /root()/customerinfo/assistant/phone/type ) = 1973 Cnt( /root()/customerinfo/assistant/name/text() ) = 1868 ---------------------------------------Top-k Pathid-Value node counts ---------------------------------------Max no. of path-value counts = 43 Cur no. of path-value counts = 43 Cnt( /root()/customerinfo/addr/country,6:Canada ) = 6144 Cnt( /root()/customerinfo/phone/type,4:work ) = 6024 Cnt( /root()/customerinfo/addr/city/text(),7:Toronto ) = 3159 Cnt( /root()/customerinfo/phone/type,4:home ) = 3030 Cnt( /root()/customerinfo/addr/city/text(),7:Markham ) = 2041 (...) ---------------------------------------Top-k Pathid-Value doc counts ---------------------------------------Max no. of path-value counts = 43 Cnt( /root()/customerinfo/addr/country,6:Canada ) = 6144 Cnt( /root()/customerinfo/phone/type,4:work ) = 6024 Cnt( /root()/customerinfo/addr/city/text(),7:Toronto ) = 3159 Cnt( /root()/customerinfo/phone/type,4:home ) = 3030 Cnt( /root()/customerinfo/assistant/phone/type,4:home ) = 2048 Cnt( /root()/customerinfo/addr/city/text(),7:Markham ) = 2041 (...) ---------------------------------------Catch All Pathid-Value Bucket ---------------------------------------Max no. of buckets = 12 Cur no. of buckets = 12 ---------------------------------------PathID = /root()/customerinfo/addr/city/text() Distinct Value Cnt = 3 2nd Highest Key = 7:Toronto 2nd Lowest Key = 6:Aurora Sum Node Cnt = 6144 Sum Doc Cnt = 6144 Data Type of Keys = String ---------------------------------------(...) Figure 14.22 Output of the db2cat utility (Continued) 423
424 14.4 Chapter 14 XML Performance and Monitoring MONITORING XML ACTIVITY Since the pureXML capabilities are deeply integrated into the DB2 engine, the existing tools for monitoring database activity also capture any XML-related activity in the system. The DB2 snapshot monitor, the event monitor, DB2 traces, CLI and JDBC traces, and so on can be used to analyze relational and XML operations. The general usage of these existing tracing and monitoring tools has not changed with the introduction of XML. Therefore we do not discuss them in detail and only look at the snapshot monitor in DB2 for Linux, UNIX, and Windows as an example. 14.4.1 Using the Snapshot Monitor in DB2 for Linux, UNIX, and Windows Chapter 3, Designing and Managing XML Storage Objects, described the data, index, and XML storage objects in a DB2 table space. These objects are abbreviated DAT, INX, and XDA respectively. The DB2 snapshot monitor counts read and write operations to each of these storage objects separately, and reports them as data, index, and xda page counters. For XML data that is inlined in the base table, XML reads and writes are included in the counters for the DAT object rather than the XDA object. Any activity pertaining to XML indexes, including the regions index, is reflected in the index counters. The way you capture snapshot information in DB2 has not changed with the introduction of DB2 pureXML. There are six snapshot monitor switches that you can turn on and off to select the information that is collected. These are listed in Table 14.9. The switches can be set at the application level (per connection) or globally at the database manager (instance) level. The instancelevel settings are persistent, but the application-level settings are only effective for the lifetime of a specific connection to the database. Table 14.9 Snapshot Monitor Switches Monitor Switch (Application Level) DBM Parameter (Instance Level) Information Provided BUFFERPOOL DFT_MON_BUFPOOL Buffer pool activity information, number of reads and writes LOCK DFT_MON_LOCK Lock waits, lock wait time, deadlocks SORT DFT_MON_SORT Number of sort operations and their behavior STATEMENT DFT_MON_STMT XQuery and SQL statement information TABLE DFT_MON_TABLE Table activity information UOW DFT_MON_UOW Unit of work information TIMESTAMP DFT_MON_TIMESTAMP Time and timestamp information
14.4 Monitoring XML Activity 425 The first command in Figure 14.23 enables the collection of bufferpool, statement, table, and timing information for the current connection. The command disables the collection of lock, sort, and UOW information. These settings remain active until the current connection to the database is terminated. The second command in Figure 14.23 sets the default monitor switches at the DB2 instance level. Any application that connects to a database inherits these settings. You can always check the setting of the monitor switches by using the GET MONITOR SWITCHES command. UPDATE MONITOR SWITCHES USING BUFFERPOOL on LOCK off SORT off STATEMENT on TABLE on UOW off TIMESTAMP on; UPDATE DBM CFG USING DFT_MON_BUFPOOL on DFT_MON_LOCK off DFT_MON_SORT off DFT_MON_STMT on DFT_MON_TABLE on DFT_MON_UOW off DFT_MON_TIMESTAMP on; Figure 14.23 Setting the snapshot monitor switches After the collection of snapshot information is enabled, you typically use two commands to work with the snapshot monitor: • RESET MONITOR FOR DATABASE <dbname> • GET SNAPSHOT FOR [ALL|DATABASE|BUFFERPOOLS|TABLES| TABLESPACES|APPLICATIONS|LOCKS|...] The RESET MONITOR command resets the monitor counters to zero. You should do this before executing a query or a workload that you want to monitor. After the completion of the workload that you want to monitor, execute the GET SNAPSHOT command to retrieve the collected information. The FOR clause in the GET SNAPSHOT command allows you to obtain snapshot counters for the whole database or by buffer pool, table space, or application. It can also restrict the monitor information to tables, locks, and other areas of interest. In a partitioned database (DPF), the snapshot monitor commands can optionally use either one of these additional clauses: • AT DBPARTITIONNUM <db-partition-number> • GLOBAL The clause AT DBPARTITIONNUM can be used to specify the database partition for which the command should be executed. This allows you to update monitor switches or to get and reset monitor information for individual database partitions in a DPF system. Alternatively, use the keyword GLOBAL to affect all partitions. Figure 14.24 provides an example of the GET SNAPSHOT command to display buffer pool information. You see the various snapshot monitor counters for the three different storage objects:
426 Chapter 14 XML Performance and Monitoring data, index, and XDA. The interpretation of the new XDA counters is the same as the corresponding DAT and INX counters. For example, Buffer pool xda logical reads shows the number of XDA pages that have been requested from the buffer pool. Out of those, some pages were not in the buffer pool and cause physical I/O to the table space. This is reflected in the counter Buffer pool xda physical reads. A low ratio of XDA physical reads to XDA logical reads indicates a high buffer pool hit ratio for XML data, which is desirable. The more XML documents are inlined in the base table, the more XML activity is reflected in the data counters instead of the XDA counters. GET SNAPSHOT FOR BUFFERPOOLS ON sampxml GLOBAL; (...) Buffer pool data logical reads Buffer pool data physical reads Buffer pool temporary data logical reads Buffer pool temporary data physical reads Buffer pool data writes Buffer pool index logical reads Buffer pool index physical reads Buffer pool temporary index logical reads Buffer pool temporary index physical reads Buffer pool xda logical reads Buffer pool xda physical reads Buffer pool temporary xda logical reads Buffer pool temporary xda physical reads Buffer pool xda writes Asynchronous pool data page reads Asynchronous pool data page writes Buffer pool index writes Asynchronous pool index page reads Asynchronous pool index page writes Asynchronous pool xda page reads Asynchronous pool xda page writes (...) Figure 14.24 = = = = = = = = = = = = = = = = = = = = = 10356 216 0 0 0 7901 14 0 0 71993 8414 0 0 0 19 0 0 0 0 932 0 Snapshot monitor output for buffer pools Instead of using the GET SNAPSHOT command you can also run SQL queries against snapshot table functions or administrative views to obtain the same snapshot information. Figure 14.25 shows an example. For further details on the snapshot views and table functions, refer to the DB2 documentation. SELECT SUBSTR(db_name,1,10) AS dbname, SUBSTR(bp_name,1,10) AS bpname, pool_xda_l_reads, pool_xda_p_reads, pool_xda_writes, pool_async_xda_reads, pool_async_xda_writes, pool_async_xda_read_reqs, pool_temp_xda_l_reads, pool_temp_xda_p_reads FROM sysibmadm.snapbp Figure 14.25 Selecting XDA information from the buffer pool administrative view
14.4 Monitoring XML Activity 427 If you get a snapshot FOR DATABASE or FOR APPLICATION, you will find a snapshot element called Xquery statements executed (see Figure 14.26). This lets you gauge the activity of native XQuery language requests. SQL/XML queries, which may contain XQuery embedded in the functions XMLQUERY, XMLEXISTS and XMLTABLE, are included in the counter Select SQL statements executed. db2 get snapshot for database on sampxml | grep –i executed Select SQL statements executed Xquery statements executed Update/Insert/Delete statements executed DDL statements executed Figure 14.26 14.4.2 = = = = 5616 123 2513 2 Number of statement executions Monitoring Database Utilities When you use DB2 utilities such as LOAD or REORG to manage XML or relational data, it can be helpful to monitor the progress or status of these utilities. In short, the monitoring of DB2 utilities works for XML just like it does for relational data. There are no special considerations for XML data. For example, in DB2 for z/OS you can use the DISPLAY UTILITY command as usual to check the current status of utilities. In DB2 for Linux, UNIX, and Windows you can use commands such as LIST UTILITIES and LOAD QUERY as you normally do. If you are familiar with these commands, then you can skip this section. In DB2 for Linux, UNIX, and Windows you can monitor the progress of BACKUP, RESTORE, RUNSTATS, REORG, and LOAD utilities. While any of these utilities are running in the DB2 engine, you can list them with the LIST UTILITIES command. For more detailed information, use the LIST UTILITIES SHOW DETAIL command. The information for each utility can include the start time, description, throttling priority (if applicable), and progress information (if available). You need to be in the SYSADM, SYSCTRL, or SYSMAINT group to execute this command. Alternatively, you can use the administrative views SNAPUTIL and SNAPUTIL_PROGRESS to obtain the same information as from the LIST UTILITIES command. Sample queries against these views are shown in Figure 14.27.
428 Chapter 14 XML Performance and Monitoring SELECT utility_type, utility_priority, SUBSTR(utility_description, 1,70) AS utility_description, SUBSTR(utility_dbname, 1, 20) AS utility_dbname, utility_state, utility_invoker_type, dbpartitionnum FROM sysibmadm.snaputil; SELECT utility_id, progress_total_units, progress_completed_units, dbpartitionnum FROM sysibmadm.snaputil_progress; Figure 14.27 List utility information using an administrative view Some utilities (LOAD and REORG) have special monitoring capabilities. For example, you can monitor LOAD operations using the LOAD QUERY command: LOAD QUERY TABLE db2admin.customer You can check the progress of table reorganizations in the SNAPTAB_REORG administrative view, as shown in Figure 14.28. SELECT SUBSTR(tabname, 1, 15) AS tab_name, SUBSTR(tabschema, 1, 15) AS tab_schema, reorg_phase, SUBSTR(reorg_type, 1, 20) AS reorg_type, reorg_status, reorg_completion, dbpartitionnum FROM sysibmadm.snaptab_reorg Figure 14.28 14.5 Checking the progress of a REORG using an administrative view BEST PRACTICES FOR XML PERFORMANCE This section summarizes important guidelines for achieving good XML performance in DB2. They are categorized into areas, such as XML Document Design, XML Storage, XML Queries, XML Indexes, XML Updates, XML Schemas, and XML Applications. Many of these best practices are further elaborated on in the relevant chapters in this book. 14.5.1 XML Document Design Choose an appropriate XML document granularity. If you can influence the design of XML documents, an important decision is how much information to include per XML document. Ideally, one XML document should correspond to one logical business object, such as a purchase order, a product, a sale, a contract, a form, or a tax return. Typically, this design results in XML documents that also match the predominant granularity of read, update,
14.5 Best Practices for XML Performance 429 delete, and insert operations, which is good for performance and simplifies application development. Combining many independent business objects into a single XML document is not usually recommended. For more details, see section 2.3, Choosing the Right Document Granularity. Design XML documents to serve your application, not to serve DB2. The requirements of your applications should be the main driver for designing XML documents. Design XML documents and XML Schemas such that they are intuitive and easy to use for your applications and application developers. There are no recommended ways to optimize XML documents specifically for DB2’s storage, indexing, and querying capabilities. Instead, follow the general XML design recommendation regarding document granularity and values versus metadata that we described in Chapter 2, Designing XML Data and Applications. 14.5.2 XML Storage Use a large page size for XML. Processing XML in DB2 typically performs best with large pages. This is why DB2 for z/OS uses a fixed page size of 16KB for XML table spaces and you cannot change it. In DB2 for Linux, UNIX, and Windows you can choose between page sizes 4KB, 8KB, 16KB, and 32KB. Page sizes 16KB or 32KB are usually best for XML data. Pages sizes are further discussed in section 3.3.2, Defining Columns, Tables, and Table Spaces for XML Data. Store XML data in a separate table space, if needed. If your table contains a mix of XML and relational data, and you know that your relational data is best served by a small page size, it can be useful to store XML in a separate table space from the rest of the table. DB2 for z/OS does that by default and you cannot change it. In DB2 for Linux, UNIX, and Windows you can use the LONG IN clause in a CREATE TABLE statement to assign LOB and XML columns to a separate table space. This table space can use a different page size and a separate buffer pool. Note that inlining XML into the base table prevents storing it in a separate table space. Tables spaces for XML data are explained in section 3.3.2, Defining Columns, Tables, and Table Spaces for XML Data, and section 3.11, XML Storage in DB2 for z/OS. Use DMS table spaces. In DB2 for Linux, UNIX, and Windows you can use SMS or DMS table spaces. The I/O performance with XML data is typically better with DMS than with SMS table spaces. Since version 9.1, DMS table spaces are the default and used by DB2’s automatic storage. Automatic storage is generally recommended. For details, see section 3.3.1, Storage Objects for XML Data. Use XML inlining wisely. In DB2 for Linux, UNIX, and Windows, inlining XML in the base table can provide several performance benefits. For example
430 Chapter 14 XML Performance and Monitoring • Inlining is a prerequisite for XML compression in DB2 9.5, and compression can significantly reduce I/O bottlenecks. • Inlining enables better prefetching for XML data, especially for queries that scan or access many documents in a column. • Inlining reduces the size and usage of the regions index. But, be aware that inlining also increases the row length, and therefore it reduces the number of rows per data page. This means that queries that access relational columns only might perform worse than without inlining. When you use inlining, best performance is often achieved if the temporary table space, such as TEMPSPACE1, has its own dedicated buffer pool. For more information on inlining, please refer to section 3.4, Using XML Base Table Row Storage (Inlining). 14.5.3 XML Queries In XPath expressions, use fully specified paths as much as possible. Fully qualified XPath expressions, such as /customerinfo/addr/city, provide DB2 with a straight navigation path to the desired data. Expressions like /customerinfo/*/city or //city require DB2 to traverse additional or all branches of an XML document tree, which is more time consuming. If you know the exact paths of the elements or attributes that you are interested in, use the fully specified path and avoid * and // to achieve better performance. An XPath such as //city should really only be used if the city element can occur on multiple different or unknown paths in your documents. Further information is provided in section 6.6, Wildcards and Double Slashes. Remember that predicates in the XMLQUERY function do not filter rows or use XML indexes. XMLQUERY is a scalar function and is applied to one document at a time. For example, the query: SELECT XMLQUERY('$INFO/PurchaseOrder/item[price > 10]') FROM purchaseorder returns as many rows as there are in the purchaseorder table. The XMLQUERY function performs filtering only within each document and might produce empty result rows if the predicate is not matched. Put the filtering condition into an XMLEXISTS predicate if you want to eliminate rows from the result and use XML indexes. See section 7.5, Common Mistakes with SQL/XML Predicates, for additional examples.
14.5 Best Practices for XML Performance 431 Use square brackets in XMLEXISTS predicates. Remember that an XMLEXISTS predicate eliminates a row from the result set only if the embedded XQuery or XPath expression returns an empty result. Hence, the embedded search condition must be enclosed in square brackets, like this: WHERE XMLEXISTS('$INFO/PurchaseOrder/item[price > 10]') Without the square brackets the enclosed expression would always produce a Boolean value true or false. It would never produce an empty result, never filter any rows, and never use an XML index. Section 7.4, Using XPath Predicates in SQL/XML with XMLEXISTS, provides further details and examples. Write proper “between” predicates. XQuery does not have a BETWEEN keyword like SQL does. Hence, “between” predicates must be written as a pair of range predicates. A pair of range predicates is best written as /purchaseorder/item/price[. > 10 and . < 20] rather than for example /purchaseorder [item/price > 10 and item/price < 20]. The notation with the dots (current context) ensures that you get the desired result and a better execution plan. For details on this execution plan, see section 9.4.2, “Between” Predicates on XML Data. In a DPF database, choose relational join predicates over XML join predicates, if possible. This guideline applies to partitioned databases in DB2 for Linux, UNIX, and Windows. Queries that join two or more tables in a partitioned database can often use better execution plans if the join condition is expressed on relational columns rather than XML columns. One example is that relational join conditions allow DB2 to recognize collocated joins. Collocated joins have the property that matching rows from two tables reside at the same node (partition) of the partitioned database. This allows the join to be computed within each node, without shipping rows between nodes, which is desirable for performance. Another advantage is that DB2 has a greater choice of join methods and repartitioning options for relational join predicates than for XML join predicates. Background on DPF is provided in section 3.10, XML in a Partitioned Database (DPF). Cast XML join predicates to an appropriate XML data type to allow XML index usage. Join predicates do not contain a literal value that DB2 can use to determine the data type of the comparison. Hence, DB2 needs to look for join matches in any possible data type. This precludes the use of XML indexes because they contain values for a single data type only. If you cast the join predicate to a certain data type, then DB2 can consider using a corresponding index. For example, the following join predicate allows the use of an XML index of type DOUBLE:
432 Chapter 14 XML Performance and Monitoring XMLEXISTS('$BOOKINFO/book/authors[author/@id/xs:double(.) = $AUTHORINFO/author/@id/xs:double(.) ]') Joins are discussed in section 9.2, Join Queries with XML Data, and section 13.4, XML Indexes and Join Predicates. Use RUNSTATS for XML data and indexes. Efficient query execution plans depend on up-to-date statistics about the data and objects in the database. This is true for XML and relational queries alike. Refresh statistics whenever significant amounts of data have been inserted, deleted, or updated. See section 14.3, Statistics Collection for XML Data. Increase the statement heap if you have very complex XML queries. In DB2 for Linux, UNIX, and Windows, the size of the statement heap determines how much memory DB2 can use for compiling and optimizing a query. When DB2 optimizes a query, it generates a set of candidate execution plans, estimates the execution cost for each plan, and chooses the plan with the lowest cost. For complex queries, DB2 might have to consider a large number of candidate plans. DB2 issues warning SQL0437W (reason code 1) if the statement heap is too small to consider a sufficient number of candidate plans. As a result, a suboptimal execution plan may be used. To solve this problem, increase the size of the statement heap (database configuration parameter stmtheap) or reduce the optimization level (database configuration parameter dft_queryopt). Appendix C contains links to general DB2 material where these parameters are explained, such as the DB2 Information Center. 14.5.4 XML Indexes Use fully specified path expressions in XML index definitions. This recommendation is similar to XML queries. Fully specified path expressions in index definitions allow DB2 to perform more efficient index maintenance than index definitions that include * or // in the XMLPATTERN. Additionally, fully specified paths ensure that you index only as many XML nodes as needed, which is good for performance and avoids unnecessarily large indexes. For examples, see section 13.1.2, Lean XML Indexes. Use appropriate namespace declarations in XML index definitions. If your XML data contains namespaces then XML index definitions need to contain either corresponding namespace declarations or namespace wildcards, which match any namespace. Otherwise the index does not contain the desired index entries and is not used for query processing. Details can be found in section 15.4, Creating Indexes for XML Data with Namespaces.
14.5 Best Practices for XML Performance 433 Use VARCHAR HASHED indexes to your advantage. DB2 for Linux, UNIX, and Windows supports XML indexes of type VARCHAR HASHED, which can index strings of arbitrary length. Each index key is an 8-byte hash code of the indexed string rather than the string value itself. This can save a lot of space if the indexed string values tend to be long. For example, if you index an element <url> that contains URLs with average length of 80 characters, a VARCHAR HASHED index uses keys, which are 10 times smaller than those of a VARCHAR(n) index. A VARCHAR HASHED index can only be used to evaluate equality predicates, which, depending on the workload, may be sufficient. More information is available in section 13.2, XML Index Data Types. Be aware of the overhead of reorganizing XML indexes. Reorganizing XML indexes is more expensive than reorganizing relational indexes. Hence, the time to reorganize all indexes for a table can be significantly increased if the set of indexes includes XML indexes. Therefore you might find it advantageous to reorganize relational and XML indexes separately and to explicitly refer to their index names in the REORG command. Since version 9.7, DB2 for Linux, UNIX, and Windows also offers online index reorganization, which keeps the table fully available during index reorganization and does not require a dedicated maintenance window. Also see section 3.7, Reorganizing XML Data and Indexes. At the time of writing, the DB2 design advisor (db2advis) does not yet recommend XML indexes. 14.5.5 XML Updates Combine multiple update operations in a single statement. If you need to make multiple modifications to a given XML document, combine multiple update operations in a single UPDATE statement rather than issuing multiple UPDATE statements. See Chapter 12, Updating and Transforming XML Documents, for examples. For best update performance, choose a small XML document granularity. Modifications within an XML document tend to perform better for smaller documents (in the KB range) rather than larger documents (in the MB range). Therefore, the previous guideline for XML document granularity (section 14.5.1) is of particular importance. If best possible update performance is critical for your application, try using a smaller document granularity. Consider splitting large documents into multiple smaller documents upon insert, as discussed in section 5.7, Splitting Large XML Documents into Smaller Documents.
434 14.5.6 Chapter 14 XML Performance and Monitoring XML Schemas Avoid repetitive document validation. Inserting or updating XML documents with validation consumes more CPU cycles than the same operations without validation. The difference is small for simple XML Schemas, but can be significant for large and complex XML Schemas. The validation of XML documents can happen at various layers in the IT stack. For example, incoming XML messages might get validated by the enterprise service bus or application server. In this case, additional validation by the DB2 server might not be necessary. Also, if XML documents are produced by a trusted application and you know that the documents are always valid, XML inserts without validation can reduce the CPU utilization on the database server. Further details are discussed in section 16.1.2, To Validate or Not to Validate, That Is the Question! as well as Chapter 17, Validating XML Documents against XML Schemas. 14.5.7 XML Applications Use pureXML instead of XML parsing in the application. Traditionally, XML applications perform a lot of XML parsing to manipulate XML documents. For example, updating or extracting values from XML documents is commonly done with application-level XML parsing. Much of this XML parsing can be avoided by using DB2 pureXML. DB2 stores XML in a parsed format, which allows value extractions, updates, and other operations to be performed without XML parsing. This allows for simpler application code and high end-to-end performance. For more information, see section 21.1.1, Avoid XML Parsing in the Application Layer. For short queries or OLTP applications, use SQL/XML statements with parameter markers or host variables. Very short database queries often execute so fast that the time to compile and optimize them is a substantial portion of their total response time. Thus, it’s useful to compile them just once and only pass literal values for each execution. The SQL/XML functions XMLQUERY, XMLTABLE, and XMLEXISTS allow you to pass SQL parameter markers or host variables as XQuery variables into the embedded XPath or XQuery expressions. Then you prepare or pre-compile the SQL/XML statements just like regular SQL statements. This is recommended for applications with short and repetitive queries. The same applies to short insert, update, and delete operations. You can see examples in section 21.2, Using Parameter Markers or Host Variables. Avoid code page conversion during XML insert and retrieval. Code page conversion can be an expensive operation. If the code page of the application is different from the code page of the database, then any character data that is passed between the database and the application undergoes code page conversion. This transcoding can be avoided if
14.6 Summary 435 applications move XML data to and from the database in binary format, using binary type variables rather than character type variables. For example, in CLI when you use SQLBindParameter() to bind parameter markers to input data buffers, you should use SQL_C_BINARY data buffers rather than SQL_C_CHAR or SQL_C_WCHAR. Similarly, when you insert XML data from a Java application, provide the XML data as a binary stream (setBinaryStream) rather than as a string (setString). Further details on code page implications can be found in section 20.2, Avoiding Code Page Conversions. For large result sets, use XMLSERIALIZE to exploit blocking cursors or LOB locators. When an application retrieves a relational result set without XML columns, a blocking cursor can be used for more efficient data transfer from the database server to the client. With a blocking cursor, a block of result rows is transferred to the client in a single operation, which is more efficient than transferring one row at a time. By default, blocking cursors cannot be used when retrieving XML type data. However, you can use the XMLSERIALIZE function to explicitly convert an XML type column in the result to a VARCHAR column, which allows blocking. This can improve the performance of queries that retrieve many small XML values, such as the following: SELECT XMLSERIALIZE(XMLQUERY('$INFO/customerinfo/name') AS VARCHAR(100) ) FROM customer This approach works as long as the serialized XML data fits within the specified VARCHAR length, which cannot exceed 32KB. The benefits of a blocking cursor can outweigh the cost of code page conversion that may occur when retrieving XML data as a VARCHAR column. If you retrieve very large XML documents from a database, XMLSERIALIZE … AS CLOB or AS BLOB enables you to use LOB locators. XMLSERIALIZE is further discussed in section 4.3, Retrieving XML Documents, and section 4.4, Handling Documents with XML Declarations. 14.6 SUMMARY DB2’s pureXML capabilities are tightly integrated with all of its relational features in the DB2 engine. Therefore, all existing performance guidelines for DB2 still apply when the database contains XML data. XML does not introduce a departure from the existing best practices for configuring, tuning, and monitoring DB2. If you are a DBA who is new to managing XML data in the database, you can safely apply your experience and knowledge with relational data to the management of XML data. That’s always a good start. Additionally, a set of XML-specific performance tips are summarized in the previous section. One of the key factors for XML query performance is the proper use of XML indexes. Hence, DB2’s features for examining access plans are critical to check which indexes are or are not used by a given query. Fortunately, all of DB2’s explain tools work for XML queries as for relational queries. To help DB2 produce good access plans, use RUNSTATS to collect statistics for your XML data.
436 Chapter 14 XML Performance and Monitoring When you examine the access plans of XML queries, you will encounter several XML-specific query operators, which DB2 uses in conjunction with relational query operators. In DB2 for Linux, UNIX, and Windows, the XML operators are called XSCAN, XISCAN, and XANDOR. They facilitate the traversal of XML documents (XSCAN) as well as the access (XISCAN) and join (XANDOR) of XML indexes. In DB2 for z/OS, the XML operators are DIXSCAN, XIXAND, XIXOR, and XIXSCAN. The XIXSCAN operator performs access to an XML index and produces document identifiers (DOCIDs). The XIXAND and XIXOR operators are used to compute the union or intersection, respectively, of sets of such DOCIDs. The DIXSCAN operator takes a DOCID as input and returns the RID of the row that a document belongs to. Examining and understanding XML query execution plans is one of the most important parts of investigating XML query performance.
C H A P T E R 15 Managing XML Data with Namespaces n the previous chapters you have learned how to store, index, query, construct, and update XML documents in DB2. In those discussions we have assumed that the XML documents do not contain namespaces. When XML namespaces are present, additional considerations are required, which are described in this chapter. The discussion of namespaces is split into the following topics: I • Understanding namespaces and namespace declarations (section 15.1) • Obtaining namespace information from XML documents (section 15.2) • Writing XQuery and SQL/XML queries for XML data with namespaces (section 15.3) • Defining and using XML indexes in the presence of namespaces (section 15.4) • Generating XML documents with namespaces (section 15.5) • Updating XML documents that contain namespaces (section 15.6) 15.1 INTRODUCTION TO XML NAMESPACES XML namespaces are a W3C XML standard for providing uniquely named elements and attributes in an XML document. XML documents can contain elements and attributes that have the same name but belong to different vocabularies (or domains). Such name conflicts can lead to ambiguity when you use these elements and attributes. This ambiguity is resolved by assigning a namespace to each vocabulary. All pureXML features in DB2 such as SQL/XML, XQuery, XML indexes, and validation with XML Schemas support XML namespaces. 437
438 Chapter 15 Managing XML Data with Namespaces As an example, consider the three XML elements in Figure 15.1. They all have the same element name, title, but they probably have different meanings. The first element might be a job title, the second element might be the title of a person, and the third could be the title of a book or a movie. If an application is not able to distinguish between them and treats them all in the same way, then processing errors or logically incorrect results are the likely consequence. <title>Database Administrator</title> <title>Mr</title> <title>Lord of the Rings</title> Figure 15.1 Three elements with identical names Figure 15.2 shows the same three elements where namespace prefixes are used to avoid naming collisions and prevent ambiguity. The prefix is separated from the original element name (which is also called the local name) by a colon. The prefix has to appear in the start tag and the end tag of an element. The namespace prefixes indicate that the three elements belong to different domains. <job:title>Database Administrator</job:title> <person:title>Mr</person:title> <movies:title>Lord of the Rings</movies:title> Figure 15.2 Three elements with identical local names but distinct prefixes As an analogy, think of relational schema names that can be used to qualify the names of relational tables or indexes. For example, in a DB2 database you can have two tables called schema1.mytable and schema2.mytable. The schema names act as prefixes for the table names and avoid ambiguity and name conflicts. To drive the analogy further, remember that a relational schema name is typically used to group together multiple database objects, such as tables and indexes, if they belong to the same application or logical domain. Similarly, a namespace is used to identify and group all XML tag names that belong to the same application domain. Typically, an XML Schema defines the element and attribute names that can appear in XML documents for a given application. Such an XML Schema often declares a so-called target namespace, which means that all elements (and optionally also all attributes) that are defined in the schema belong to this specific namespace. XML Schemas and their namespaces are explained in Chapter 16, Managing XML Schemas. There is more to namespaces than the prefixes for tag names. In particular, a namespace needs to be declared and a prefix has to be assigned to a Universal Resource Identifier (URI) before it can
15.1 Introduction to XML Namespaces 439 be used. An XML namespace is identified by a URI, and a namespace prefix only acts as an abbreviation or alias for the URI. The following strings are examples of URIs: • http://www.DB2pureXML-Cookbook.org/ • ftp://ftp.is.co.za/rfc/rfc3986.txt • urn:xmlns:bogus:partner1.0 • telnet://192.0.2.16:80/ URIs often have the style of Uniform Resource Locators (URLs) or Uniform Resource Names (URNs). However, a URI does not retrieve data from the specified location. If a URI has the form of a URL, the URL does not need to reference a real web page; it can be a “fake” URL that simply serves as an identifier. Namespace URIs and namespace prefixes are case sensitive and must not contain spaces. Appendix C, Further Reading, contains pointers to more details on URIs. There is no XML well-formedness rule that requires namespace names to be URIs. For example, URIs that contain spaces (blanks) are invalid URIs but they do not affect the well-formedness of an XML document. Therefore, you can insert XML documents that have spaces in their namespace URIs. However, URIs with spaces cannot be declared in a query. This makes it difficult and sometimes impossible to query documents with such invalid URIs. Also, it is not possible to refer to URIs with spaces in an xsi:schemaLocation attribute because the spaces would be interpreted as delimiters for the URIs in the list. In short, you should never use spaces in namespace URIs. 15.1.1 Namespace Declarations in XML Documents XML namespaces are declared in XML documents with the reserved attribute xmlns:prefix where prefix is the namespace prefix that you want to use. The value of this reserved attribute must be a URI. Figure 15.3 shows an XML document whose root element customerinfo contains a namespace declaration. You will later see that namespaces can be declared in any element, not just the root element. The attribute xmlns:cust declares that cust is a namespace prefix and bound to the URI http://posample.org. The prefix cust can therefore be used for the customerinfo element and all other elements or attributes in the document that are descendants of customerinfo. In Figure 15.3, all elements that carry the prefix cust belong to the namespace http://posample.org. <cust:customerinfo xmlns:cust="http://posample.org" Cid="1000"> <cust:name>Kathy Smith</cust:name> <cust:addr country="Canada"> <cust:street>5 Rosewood</cust:street> <cust:city>Toronto</cust:city> </cust:addr> <cust:phone type="work">416-555-1358</cust:phone> </cust:customerinfo> Figure 15.3 Document with namespace and prefix declaration
440 Chapter 15 Managing XML Data with Namespaces The attributes Cid and country in Figure 15.3 do not belong to any namespace because they do not have a prefix and do not inherit the namespace of the element that they belong to. Since an attribute always belongs to a specific element and cannot occur by itself, the namespace of the attribute’s element is sufficient to avoid attribute ambiguity. Therefore attributes typically do not need to be in a namespace and do not require prefixes. If you want to assign an attribute to a namespace you must add a prefix to the attribute, like this: <cust:addr cust:country="Canada"> Not every node in a document has to belong to the same namespace. For example, the document in Figure 15.4 uses the namespace prefix c only for the elements customerinfo, name, and phone. These elements belong to the namespace http://posample.org while the elements addr, street, and city as well as the attributes Cid and country do not belong to any namespace. <c:customerinfo xmlns:c="http://posample.org" Cid="1000"> <c:name>Kathy Smith</c:name> <addr country="Canada"> <street>5 Rosewood</street> <city>Toronto</city> </addr> <c:phone type="work">416-555-1358</c:phone> </c:customerinfo> Figure 15.4 Document in which some elements do not have a namespace An XML document can contain multiple namespaces, as shown in Figure 15.5. This document has a second namespace declaration located in the addr element. It assigns the prefix add to the URI http://myAddresses.org. The addr element itself, as well as any descendent nodes under addr, can use the prefix add to indicate that it belongs to the same namespace. However, the elements customerinfo, name, and phone cannot use the prefix add because they are not children of the addr element and therefore not in the scope of the namespace declaration that defines the prefix add. Namespace prefixes can only be used in the subtree of the document for which they are declared. For example, if you used the prefix add for the customerinfo element in Figure 15.5, DB2 rejects the document upon insert with the error SQL16193N. <c:customerinfo xmlns:c="http://posample.org" Cid="1000"> <c:name>Kathy Smith</c:name> <add:addr xmlns:add="http://myAddresses.org" country="Canada"> <add:street>5 Rosewood</add:street> <add:city>Toronto</add:city> </add:addr> <c:phone type="work">416-555-1358</c:phone> </c:customerinfo> Figure 15.5 Document with multiple namespaces
15.1 Introduction to XML Namespaces 441 If a document contains multiple namespaces, their prefixes must be distinct. Namespaces can also be interleaved within a document, as in Figure 15.6. The namespace prefix c can be used for the elements street and city, because they are in the scope of the namespace declaration for the prefix c. <c:customerinfo xmlns:c="http://posample.org" Cid="1000"> <c:name>Kathy Smith</c:name> <add:addr xmlns:add="http://myAddresses.org" country="Canada"> <c:street>5 Rosewood</c:street> <c:city>Toronto</c:city> </add:addr> <c:phone type="work">416-555-1358</c:phone> </c:customerinfo> Figure 15.6 Document with interleaved use of namespaces Since XML has traditionally been used in web-based application, most people use URLs or URNs to identify their namespaces. But there is nothing to stop you from using any other form of string to name a namespace. For example, the following is a well-formed XML document: <hh:customer xmlns:hh="HappyHolidays"></hh:customer> The full name (also called the expanded name) of any element or attribute consists of two parts, the local name and the namespace name. In the preceding example, the local name of the element is customer and the namespace name is HappyHolidays. The notation hh:customer represents the full name where hh is a reference to HappyHolidays. When a namespace-compliant XML processor, such as DB2, evaluates XPath expressions over XML data, elements and attributes are always identified by their full name and never by their local name alone. Full names are used even when no namespaces are declared. In this case, the namespace part of a full name is empty and cannot match a node name whose namespace is not empty. Note that an attribute can be in a different namespace than the element it belongs to. In Figure 15.7, the attribute country belongs to the namespace http://custinfo.org, whereas the element addr belongs to the namespace http://addr.org. <c:customerinfo xmlns:c="http://custinfo.org" Cid="1000"> <c:name>Kathy Smith</c:name> <a:addr xmlns:a="http://addr.org" c:country="Canada"> <street>5 Rosewood</street> <city>Toronto</city> </a:addr> <c:phone type="work">416-555-1358</c:phone> </c:customerinfo> Figure 15.7 Attribute and element in different namespaces
442 15.1.2 Chapter 15 Managing XML Data with Namespaces Default Namespaces If all elements in a document belong to the same namespace, it is often useful to declare a default namespace and avoid the use of prefixes. The namespace declared in Figure 15.8 is a default namespace because the xmlns attribute does not assign a prefix to the URI. The default namespace applies to all elements in its scope. In Figure 15.8, the scope of the default namespace is the entire document. The default namespace does not apply to attributes. The attributes Cid and country do not belong to any namespace, only their respective elements do. <customerinfo xmlns="http://posample.org" Cid="1000"> <name>Kathy Smith</name> <addr country="Canada"> <street>5 Rosewood</street> <city>Toronto</city> </addr> <phone type="work">416-555-1358</phone> </customerinfo> Figure 15.8 Document with a default namespace You can also think of a default namespace as a namespace with an empty prefix. Every element that has an empty prefix belongs to the default namespace. A default namespace can be overridden by another default namespace at a lower level of the document. In the document in Figure 15.9, the addr element and all of its descendant elements (street and city) belong to the default namespace http://myAddresses.org, which overrides the namespace http://posample.org. The phone element is a child of customerinfo and belongs to the namespace http://posample.org. <customerinfo xmlns="http://posample.org" Cid="1000"> <name>Kathy Smith</name> <addr xmlns="http://myAddresses.org" country="Canada"> <street>5 Rosewood</street> <city>Toronto</city> </addr> <phone type="work">416-555-1358</phone> </customerinfo> Figure 15.9 Document with two default namespaces XML documents can contain a mix of default and prefix namespaces. In Figure 15.10, all elements that do not have a prefix belong to the default namespace http://posample.org. The elements addr and street belong to the namespace http://myAddresses.org. Note that the city element has no prefix. It does not inherit the namespace from its parent (addr) but assumes the default namespace, http://posample.org.
15.1 Introduction to XML Namespaces 443 <customerinfo xmlns="http://posample.org" xmlns:add="http://myAddresses.org" Cid="1000"> <name>Kathy Smith</name> <add:addr country="Canada"> <add:street>5 Rosewood</add:street> <city>Toronto</city> </add:addr> <phone type="work">416-555-1358</phone> </customerinfo> Figure 15.10 Document with a default namespace and prefixed namespace Figure 15.11 shows the tree representation of the document in Figure 15.10. The shading of the nodes indicates their namespace. The gray element nodes addr and street belong to the namespace http://myAddresses.org. The striped elements customerinfo, name, phone, and city belong to the default namespace. The attribute nodes (double bordered, white background) and text nodes (single bordered, white background) do not belong to any namespace. Note that a document tree never contains attribute nodes for the reserved namespace attributes (xmlns). We investigate this document further in section 15.2. ccustomerinfo ustomerinfo Cid = 1006 n name ame Kathy Smith Figure 15.11 ph on e hone addr country = Canada street ccity ity 5 Rosewood Toronto type = work 416-555-1358 Tree representation of the document in Figure 15.10 Figure 15.12 and Figure 15.13 illustrate additional important characteristics of namespaces and element names. The three elements in Figure 15.12 might look different at first sight, but they have identical names. Remember that the full (expanded) name of an element consists of two parts, its local name and the namespace URI. For all three elements, the local name is addr, and the namespace name is http://myAddresses.org. The fact that the first two elements have different prefixes is not relevant. Both prefixes, x and y, are bound to the same namespace URI and therefore represent the same namespace. In other words, x and y are aliases for the same thing. The third addr element also belongs to the namespace http://myAddresses.org because this is the declared default namespace. Prefixed namespaces and default namespaces are
444 Chapter 15 Managing XML Data with Namespaces just two different mechanisms to declare that an element belongs to a certain namespace. The resulting namespace of an element is the same if the URIs are the same. <x:addr xmlns:x="http://myAddresses.org"></x:addr> <y:addr xmlns:y="http://myAddresses.org"></y:addr> <addr xmlns="http://myAddresses.org"></addr> Figure 15.12 Three XML elements with identical names Figure 15.13 shows three elements that have different full names, although they all have the same local name (addr). The first two elements have identical namespace prefixes. This, however, is irrelevant because in the first element the prefix x is assigned to a different URI than in the second element. Hence, the namespace part of their full name is different. Note that these two elements cannot appear in the same document. Within a document, the same prefix cannot be assigned to two different namespace URIs. The third element also has the local name addr, but the namespace of its full name is empty. <x:addr xmlns:x="http://myAddresses.org"></x:addr> <x:addr xmlns:x="http://yourAddresses.org"></x:addr> <addr></addr> Figure 15.13 Three distinct XML elements Whether two elements have the same name or not is always decided based on their full (expanded) name and never based on their local name alone. Hence, an element addr without a namespace is different from an element addr that has a namespace, just like elements addr and phone are different. Since XML queries use path expressions that contain element names, the proper usage of namespaces in queries is critical for obtaining correct query results. 15.2 EXPLORING NAMESPACES IN XML DOCUMENTS If you are not certain about the namespaces in a particular document or about the namespace of a particular node, you can use the XQuery functions in Table 15.1 to obtain namespace information for specific elements or attributes. These functions are available in DB2 for Linux, UNIX, and Windows.
15.2 Exploring Namespaces in XML Documents Table 15.1 445 Commonly Used Namespace and Node Functions Name and Node Functions Description name Returns the name of a node, such as an element or attribute name. The returned name includes a namespace prefix, if applicable. local-name Returns the local name of a node, without a namespace prefix. namespace-uri Returns the namespace URI of a given node. in-scope-prefixes Returns a list of prefixes for all in-scope namespaces of an element. namespace-uri-for-prefix Returns the namespace URI that is associated with a given prefix of an in-scope namespace of an element. The query in Figure 15.14 is an example of how you can use these functions. The query iterates over all elements in all documents in the XML column DOC in the table MYTABLE. For each element it returns a document with namespace information. xquery for $i in db2-fn:xmlcolumn("MYTABLE.DOC")//* return <element> <local-name>{local-name($i)}</local-name> <name>{name($i)}</name> <namespace>{namespace-uri($i)}</namespace> <in-scope-namespaces> {for $j in in-scope-prefixes($i) return <namespace prefix="{$j}"> {namespace-uri-for-prefix($j,$i)} </namespace> } </in-scope-namespaces> </element>; Figure 15.14 Collecting namespace information for all elements Let’s use this query to obtain information about some of the elements of the XML document shown in Figure 15.10. The namespace information for the elements customerinfo and street is shown in Figure 15.15. The name and local-name of the customerinfo element are identical because the element belongs to the default namespace http://posample.org, which has no prefix (that is, an empty prefix). The in-scope namespaces are the namespaces in whose scope the element is. Both namespaces http://posample.org and http://myAddresses.org are declared at the customerinfo element. Therefore, the customerinfo element is in the scope of both namespace declarations although it belongs to only one of them. Note that the default namespace http://posample.org is associated with the empty prefix. We will come back to the empty prefix of the default namespace when we discuss updates of XML documents with namespaces. (The third in-scope namespace with the prefix xml is pre-declared in XQuery and always exists. You can ignore it for now.)
446 Chapter 15 Managing XML Data with Namespaces <element> <local-name>customerinfo</local-name> <name>customerinfo</name> <namespace>http://posample.org</namespace> <in-scope-namespaces> <namespace prefix="">http://posample.org</namespace> <namespace prefix="add">http://myAddresses.org</namespace> <namespace prefix="xml">http://www.w3.org/XML/1998/namespace </namespace> </in-scope-namespaces> </element> <element> <local-name>street</local-name> <name>add:street</name> <namespace>http://myAddresses.org</namespace> <in-scope-namespaces> <namespace prefix="">http://posample.org</namespace> <namespace prefix="add">http://myAddresses.org</namespace> <namespace prefix="xml">http://www.w3.org/XML/1998/namespace </namespace> </in-scope-namespaces> </element> Figure 15.15 Partial output from the query in Figure 15.14 The second XML element described in Figure 15.15 has the local name street, the name add:street, and the namespace http://myAddresses.org. Remember that the function name returns the local name plus the namespace prefix of an element. The element street is in the scope of the same namespaces as the element customerinfo and all other elements. The query in Figure 15.16 shows another application of the XQuery functions in Table 15.1. For each element, attribute, and text node the query returns a sequence number, the node kind, the node name, the value of the node, and the namespace of the node. The output shown is produced from the document in Figure 15.10. The sequence number represents depth-first order of the nodes. Remember that the value of an element is defined as the concatenation of all descendant text nodes. Text nodes do not have a name or a namespace. If namespaces are declared in XML documents, then they also need to be declared in the queries that run against them. If queries do not declare namespaces, then they might not find matching elements and might produce empty result sets. The handling of namespaces in XML queries is the topic of the next section.
15.3 Querying XML Data with Namespaces 447 SELECT x.seq, x.kind, x.node AS nodename, SUBSTR(x.value,1,16) AS value, x.uri AS "Namespace URI" FROM mytable, XMLTABLE('$DOC//(*, @*,text())' COLUMNS seq FOR ORDINALITY, node VARCHAR(35) PATH 'name(.)', value VARCHAR(200) PATH 'substring(.,1,200)', kind VARCHAR(4) PATH 'if (self::attribute()) then "ATTR" else (if (self::text()) then "TEXT" else "ELEM")', uri VARCHAR(50) PATH 'namespace-uri(.)' ) AS x; SEQ --1 2 3 4 5 6 7 8 9 10 11 12 13 KIND ---ELEM ATTR ELEM TEXT ELEM ATTR ELEM TEXT ELEM TEXT ELEM ATTR TEXT NODENAME ------------customerinfo Cid name add:addr country add:street city phone type VALUE ---------------Kathy Smith5 Ros 1000 Kathy Smith Kathy Smith 5 RosewoodToront Canada 5 Rosewood 5 Rosewood Toronto Toronto 416-555-1358 work 416-555-1358 Namespace URI ----------------------http://posample.org http://posample.org http://myAddresses.org http://myAddresses.org http://posample.org http://posample.org 13 record(s) selected. Figure 15.16 15.3 Nodes and namespaces in the document in Figure 15.11 QUERYING XML DATA WITH NAMESPACES Querying XML data always involves path expressions that navigate to specific elements or attributes in order to extract XML values or evaluate predicates. If the element and attribute names in a path expression do not match the full names of the elements and attributes in the XML documents, then the query returns an empty result set. Full names (expanded names) consist of namespace name and local name, and both must be specified in path expressions to match the element and attribute names in the documents. Therefore, any XPath or XQuery expression can have a prolog that consists of one or more namespace declarations. This also applies to XPath expressions that are embedded in an SQL/XML function, such as XMLQUERY, XMLTABLE, or XMLEXISTS. We first explain namespace declarations for XQuery expressions in general, and then show their usage in SQL/XML queries.
448 Chapter 15 Managing XML Data with Namespaces Many of the examples in the remainder of this chapter use customerinfo documents that contain one default namespace, such as the sample document in Figure 15.17. We assume that these documents reside in the XML column info of the table customer2. <customerinfo xmlns="http://posample.org" Cid="1000"> <name>Kathy Smith</name> <addr country="Canada"> <street>5 Rosewood</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M6W 1E6</pcode-zip> </addr> <phone type="work">416-555-1358</phone> </customerinfo> Figure 15.17 15.3.1 Sample document with a namespace (table customer2) Declaring Namespaces in XML Queries Namespace declarations in a query can have one of the following two forms: declare namespace cust="http://posample.org"; declare default element namespace "http://posample.org"; The first declaration declares a namespace with URI http://posample.org and prefix cust. The second declaration defines a default namespace. The keywords default element namespace emphasize the fact that a default namespace only applies to elements, not to attributes. All keywords must be in lowercase. Each namespace declaration in a query must end with a semicolon (;). This syntax rule clashes with the default termination character for statements and commands in the DB2 Command Line Processor (CLP). To solve this conflict, invoke the DB2 CLP with the option –td# to use the # character as the termination character (db2 –td#). You can choose another character instead of # if you prefer. Figure 15.18 shows an XQuery FLWOR expression with a default namespace declaration. This declaration binds all element names in the query to the namespace http://posample.org. It ensures that the element names used in the for, where, and return clauses have the same namespace as the element names in the documents in table customer2 (see Figure 15.17). The result set of the query contains name elements from two documents. Each of these elements is in the namespace http://posample.org and therefore carries a namespace declaration. The namespaces in the query result set are always determined by the namespaces in the original documents that you query.
15.3 Querying XML Data with Namespaces 449 xquery declare default element namespace "http://posample.org"; for $c in db2-fn:xmlcolumn("CUSTOMER2.INFO")/customerinfo where $c/addr/city= "Markham" and $c/addr/@country = "Canada" return $c/name # <name xmlns="http://posample.org">Kathy Smith</name> <name xmlns="http://posample.org">Jim Noodle</name> 2 record(s) selected. Figure 15.18 XQuery with a default namespace declaration Remember that a namespace with prefixes can achieve the same binding of element names to a URI as a default namespace declaration. Hence, you can also write the query in Figure 15.18 with namespace prefixes instead of a default namespace and obtain exactly the same result set. This is shown in Figure 15.19. It does not matter whether the XML documents in the table use default namespace declarations, namespace prefixes, or a mix of both. What matters exclusively is that the same namespace URIs are declared in the query and the target document. Also, any namespace prefixes used in the query can be different from any prefixes in the XML documents, as long as the prefixes are associated with the same URI. The attribute country in the where clause of Figure 15.19 does not have a namespace prefix and therefore matches the attribute’s empty namespace in the source document. Otherwise the attribute cannot be found and the where clause evaluates to false. The style of the namespace declaration (default versus prefixes) in the result elements is determined by how the namespace is declared in the original document, not how the namespace is declared in the query. xquery declare namespace cust="http://posample.org"; for $c in db2-fn:xmlcolumn("CUSTOMER2.INFO")/cust:customerinfo where $c/cust:addr/cust:city = "Markham" and $c/cust:addr/@country = "Canada" return $c/cust:name # <name xmlns="http://posample.org">Kathy Smith</name> <name xmlns="http://posample.org">Jim Noodle</name> 2 record(s) selected. Figure 15.19 XQuery with namespace prefixes The namespace declarations in Figure 15.18 and Figure 15.19 restrict the queries to one specific namespace. However, there can be situations in which you want to query across multiple namespaces. For example, you can store documents for different versions of an XML Schema in the
450 Chapter 15 Managing XML Data with Namespaces same XML column. These documents can carry different namespaces although the majority of their structure and their local element names are identical. You can use wildcards instead of namespace prefixes to match elements in any namespace and avoid namespace declarations. The query in Figure 15.20 is an example. This query works equally well for the documents in Figure 15.17 and Figure 15.10, which differ in their namespaces. The namespace wildcards also match the empty namespace. Thus, the query can even return name elements from customer documents without any namespaces. Note that the query also uses a namespace wildcard for the country element in the where clause. This wildcard enables the query to also retrieve data from customer documents that may assign namespaces to attributes. xquery for $c in db2-fn:xmlcolumn("CUSTOMER2.INFO")/*:customerinfo where $c/*:addr/*:city = "Markham" and $c/*:addr/@*:country = "Canada" return $c/*:name # <name xmlns="http://posample.org">Kathy Smith</name> <name xmlns="http://posample.org">Jim Noodle</name> 2 record(s) selected. Figure 15.20 XQuery with namespace wildcards Let’s summarize the characteristics of namespace wildcards in queries. Namespace wildcards • Match any namespace, including the empty namespace • Enable you to write queries without namespace declarations • Relieve you from knowing the exact namespace URIs used in the XML documents • Enable you to query XML documents across multiple namespaces • Do not restrict data access and query results to a particular namespace The flexibility and ease of use of namespace wildcards are very compelling advantages in many application scenarios. However, one reason why you might not want to use namespace wildcards is that they don’t restrict data access to a particular namespace. For example, if you intentionally want to retrieve values from documents in one namespace but not others, namespace wildcards cannot be used. Also, remember that the original purpose of namespaces is to avoid naming conflicts by pairing local names with URIs. However, namespace wildcards disregard the URIs and reduce the comparison of nodes to local names. This can be either desirable or undesirable, depending on the nature and requirements of a given application scenario.
15.3 Querying XML Data with Namespaces 15.3.2 451 Using Namespace Declarations in SQL/XML Queries When you use XPath or XQuery expressions in the SQL/XML functions XMLQUERY and XMLTABLE, or in the XMLEXISTS predicate, then these expressions can contain namespace declarations such as the ones discussed in the previous section. Figure 15.21 shows the same queries as in the previous section but in SQL/XML notation. The first statement uses default namespaces, the second uses namespace prefixes, and the third uses namespace wildcards. If you use namespace declarations, each XMLQUERY and XMLTABLE function and each XMLEXISTS predicate must have its own declaration. There is no mechanism to declare namespaces just once for all SQL/XML functions in the query. In SQL/XML queries, namespaces cannot be declared at the SQL level outside of the SQL/XML functions. Within the same query, different SQL/XML functions can declare namespaces in different ways. For example, you can choose to define namespace prefixes in the XMLQUERY function and to declare a default namespace or use namespace wildcards in the XMLEXISTS predicate of the same query. SELECT XMLQUERY(' declare default element namespace "http://posample.org"; $INFO/customerinfo/name') FROM customer2 WHERE XMLEXISTS(' declare default element namespace "http://posample.org"; $INFO/customerinfo/addr[city = "Markham" and @country = "Canada"]') # SELECT XMLQUERY(' declare namespace cust="http://posample.org"; $INFO/cust:customerinfo/cust:name') FROM customer2 WHERE XMLEXISTS(' declare namespace cust="http://posample.org"; $INFO/cust:customerinfo/cust:addr[cust:city = "Markham" and @country = "Canada"]') # SELECT XMLQUERY('$INFO/*:customerinfo/*:name') FROM customer2 WHERE XMLEXISTS('$INFO/*:customerinfo/*:addr[ *:city = "Markham" and @*:country = "Canada"]') # Figure 15.21 Three different ways of handling namespaces in a query The SQL/XML statements in Figure 15.21 can also be run in DB2 for z/OS if you add the clause PASSING info AS "INFO" to each XMLEXISTS predicate and XMLQUERY function. NOTE
452 Chapter 15 15.3.3 Managing XML Data with Namespaces Using Namespaces in the XMLTABLE Function The XMLTABLE function is different from the XMLEXISTS predicate and the XMLQUERY function because it contains multiple XQuery expressions and not just one. More specifically, it contains one row-generating expression and one or multiple column-generating expressions, as explained in section 7.3, Retrieving XML Values in Relational Format with XMLTABLE. For example, in Figure 15.22 the row-generating XPath expression is $INFO/*:customerinfo[*:addr/ *:city="Markham"] and provides the context for the column-generating expressions @Cid, *:name, and *:addr/*:city. In this example, most of these expressions use namespace wildcards because the source table customer2 contains documents with namespaces. No namespace wildcard is required for the Cid attribute because it does not belong to any namespace. The result set produced by the XMLTABLE function does not contain namespace declarations because the result consists of non-XML data types that never contain namespaces. SELECT x.id, x.name, x.zip FROM customer2, XMLTABLE('$INFO/*:customerinfo[*:addr/*:city="Markham"]' COLUMNS id INTEGER PATH '@Cid', name VARCHAR(20) PATH '*:name', zip VARCHAR(15) PATH '*:addr/*:pcode-zip' ) AS x; ID ----------1001 1002 NAME -------------------Kathy Smith Jim Noodle ZIP --------------N9C 3T6 N9C 3T6 2 record(s) selected. Figure 15.22 XMLTABLE function with namespace wildcards Instead of namespace wildcards, each XQuery expression in the XMLTABLE function can have its own namespace declaration, as shown in Figure 15.23. These namespace declarations can differ from each other, for example, if there are multiple namespaces within each XML document. You don’t have to declare a namespace for the Cid attribute, which doesn’t belong to any namespace.
15.3 Querying XML Data with Namespaces 453 SELECT x.id, x.name, x.zip FROM customer2, XMLTABLE('declare default element namespace "http://posample.org"; $INFO/customerinfo[addr/city="Markham"]' COLUMNS id INTEGER PATH '@Cid', name VARCHAR(20) PATH 'declare namespace c="http://posample.org"; c:name', zip VARCHAR(15) PATH 'declare namespace d="http://posample.org"; d:addr/d:pcode-zip' ) AS x # Figure 15.23 XMLTABLE function with multiple namespace declarations The query in Figure 15.23 repeats the namespace declaration three times even though there is only one namespace URI. Fortunately, this repetition of namespace declarations can be avoided with the SQL/XML function XMLNAMESPACES. It allows you to declare one or multiple namespaces inside an XMLTABLE function. These namespaces are global for all expressions in the XMLTABLE function. Therefore, the query in Figure 15.23 can be rewritten as shown in Figure 15.24 where a single default namespace is declared for all XQuery expressions in the XMLTABLE function. The query in Figure 15.24 returns the same result from the table customer2 as the queries in Figure 15.22 and Figure 15.23. There is no significant performance difference between these three queries. SELECT x.id, x.name, x.city FROM customer2, XMLTABLE(XMLNAMESPACES(DEFAULT 'http://posample.org'), '$INFO/customerinfo[addr/city="Markham"]' COLUMNS id INTEGER PATH '@Cid', name VARCHAR(20) PATH 'name', city VARCHAR(15) PATH 'addr/city' ) AS x; Figure 15.24 Using XMLNAMESPACES to declare a default namespace Be aware of the difference between the XMLNAMESPACES function and the declare namespace clauses that we used in previous queries. The clauses declare namespace and declare default element namespace are part of the XQuery language and appear as a prolog of an XQuery expression. In contrast, the XMLNAMESPACES function is part of the SQL language and defined in the SQL/XML standard. It can appear only as an argument of the functions XMLTABLE, XMLELEMENT, and XMLFOREST.
454 15.3.4 Chapter 15 Managing XML Data with Namespaces Dealing with Multiple Namespaces per Document Writing XML queries becomes slightly more interesting if your XML documents contain multiple namespaces. As a sample document for this discussion we use the XML document that we previously discussed in Figure 15.10 and Figure 15.11. For convenience it is repeated here in Figure 15.25. We assume that this one document is stored in table customer3. <customerinfo xmlns="http://posample.org" xmlns:add="http://myAddresses.org" Cid="1000"> <name>Kathy Smith</name> <add:addr country="Canada"> <add:street>5 Rosewood</add:street> <city>Toronto</city> </add:addr> <phone type="work">416-555-1358</phone> </customerinfo> Figure 15.25 Document with a default namespace and a prefix namespace When you query documents that contain multiple namespaces, declaring a single default namespace in your query is not sufficient. One possible solution is to use namespace wildcards as discussed in the previous sections. Since those wildcards match any namespace, the queries are the same for documents with one namespace, no namespace, or many namespaces. Hence, namespace wildcards are often a simple and resilient solution for namespace complexity. If you want to query data in multiple specific namespaces instead of all namespaces, your queries need to contain multiple namespace declarations. The query in Figure 15.26 selects the name element from the document in Figure 15.25 and verifies that the city element has the value Toronto. A single namespace declaration is sufficient in the XMLQUERY function because all elements on the path /customerinfo/name belong to the same namespace. But, the XPath expression in the XMLEXISTS predicate contains elements from two different namespaces, both of which must be declared. You can either declare one default namespace and one namespace with a prefix, or two namespaces with prefixes. The latter approach is taken in the XMLEXISTS predicate in Figure 15.26. Note that the namespace prefixes declared in the query do not have to match the prefixes in the XML document, only the namespace URIs have to match.
15.3 Querying XML Data with Namespaces 455 SELECT XMLQUERY(' declare default element namespace "http://posample.org"; $INFO/customerinfo/name') FROM customer3 WHERE XMLEXISTS(' declare namespace p="http://posample.org"; declare namespace a="http://myAddresses.org"; $INFO/p:customerinfo/a:addr[p:city = "Toronto"]') # <name xmlns="http://posample.org" xmlns:add="http://myAddresses.org">Kathy Smith</name> 1 record(s) selected. Figure 15.26 SQL/XML statement with multiple namespace declarations The name element that is returned by the query in Figure 15.26 has two namespace declarations. Those are all of the in-scope namespaces of the name element; that is, all namespaces that are declared in the original XML document at the name element or at any of its ancestors in the document tree. Although the name element itself belongs to only one of those namespaces (http://posample.org), it could potentially contain child elements that belong to other inscope namespaces. This is illustrated in Figure 15.27 where an XQuery returns the addr element. The addr element contains the city element, which is in the default namespace of the original document. This namespace of the city element is maintained in the query result because the addr element contains declarations for all in-scope namespaces. Since namespaces are an integral component of XML element names, there is no mechanism to return XML elements without their in-scope namespace declarations. xquery declare namespace p="http://posample.org"; declare namespace a="http://myAddresses.org"; for $i in db2-fn:xmlcolumn("CUSTOMER3.INFO")/p:customerinfo where $i/a:addr/p:city = "Toronto" return $i/a:addr # <add:addr xmlns="http://posample.org" xmlns:add="http://myAddresses.org" country="Canada"> <add:street>5 Rosewood</add:street> <city>Toronto</city> </add:addr> 1 record(s) selected. Figure 15.27 XQuery with multiple namespace declarations
456 Chapter 15 Managing XML Data with Namespaces A query result does not contain namespace declarations if you retrieve text nodes instead of elements, or if you use the XMLTABLE function to convert XML values to relational data types. The query in Figure 15.28 uses the XMLTABLE function with multiple namespaces declared in the XMLNAMESPACES function. SELECT x.id, x.name, x.city FROM customer3, XMLTABLE(XMLNAMESPACES('http://posample.org' as "p", 'http://myAddresses.org' as "a"), '$INFO/p:customerinfo[a:addr/p:city="Toronto"]' COLUMNS id INTEGER PATH '@Cid', name VARCHAR(20) PATH 'p:name', city VARCHAR(15) PATH 'a:addr/p:city' ) AS x ; ID NAME CITY ----------- -------------------- --------------1000 Kathy Smith Toronto 1 record(s) selected. Figure 15.28 15.4 Using the XMLNAMESPACES function with XMLTABLE CREATING INDEXES FOR XML DATA WITH NAMESPACES The discussion of querying XML data with namespaces has shown that proper declaration of namespaces in XML queries is critical to retrieving the desired data. Queries that do not declare namespaces correctly or do not use namespace wildcards typically return empty result sets. Equivalent concepts apply to XML indexes. If you define an XML index on XML documents that contain namespaces, the index definition needs to account for the namespaces. Otherwise the index definition might not match any XML nodes and will be empty. The remainder of this section assumes that you are already familiar with XML indexes and the conditions for index eligibility, as discussed in Chapter 13, Defining and Using XML Indexes. As an example, let’s continue working with the table customer2, which contains XML documents with a single namespace (Figure 15.17). Suppose you frequently look up customers’ information based on their phone number. An index on the phone element can help speed up such searches. However, the index in Figure 15.29 is not suitable because it only builds index entries for phone elements that do not belong to any namespace. Therefore, this index does not contain index entries for documents with a namespace such as the one in Figure 15.17. When a query searches for phone elements in a specific namespace, or if it uses wildcards to search in any namespace, the index in Figure 15.29 is not eligible.
15.4 Creating Indexes for XML Data with Namespaces 457 CREATE INDEX idx1 ON customer2(info) GENERATE KEYS USING XMLPATTERN '/customerinfo/phone' AS SQL VARCHAR(30) # Figure 15.29 Creating an index without namespace handling Similar to XML queries, there are three ways to handle namespaces in XML index definitions: • Declare and use a namespace prefix in the XMLPATTERN (Figure 15.30) • Declare a default namespace prefix in the XMLPATTERN (Figure 15.31) • Use namespace wildcards in the XMLPATTERN (Figure 15.32) The syntax for namespace declarations in indexes is the same as for namespace declarations in queries. Figure 15.30 defines an index on phone elements in the namespace http://posample.org. The namespace prefix is irrelevant and does not have to match the prefixes used in the XML documents or queries. What matters is that the namespace URI of the index matches the namespace URI in the XML documents and your queries. CREATE INDEX idx2 ON customer2(info) GENERATE KEYS USING XMLPATTERN 'declare namespace ns="http://posample.org"; /ns:customerinfo/ns:phone' AS SQL VARCHAR(30) # Figure 15.30 Creating an index with namespace prefixes The index idx3 in Figure 15.31 declares the same URI as a default namespace, and therefore does not use prefixes in the XML pattern /customerinfo/phone. The index definitions in Figure 15.30 and Figure 15.31 are equivalent. Both indexes contain entries for the same phone elements and can be used for the same queries. There is no preference for either one because they are just different notations for the same index. CREATE INDEX idx3 ON customer2(info) GENERATE KEYS USING XMLPATTERN 'declare default element namespace "http://posample.org"; /customerinfo/phone' AS SQL VARCHAR(30) # Figure 15.31 Creating an index with a default element namespace The index definition in Figure 15.32 uses namespace wildcards to match customer phone elements in any namespace. This index is different from the previous two because it can contain index entries for phone elements from multiple different namespaces, including the empty namespace. If the info column contains documents that are structurally the same but have different namespaces, then this index contains information for all of them.
458 Chapter 15 Managing XML Data with Namespaces CREATE INDEX idx4 ON customer2(info) GENERATE KEYS USING XMLPATTERN '/*:customerinfo/*:phone' AS SQL VARCHAR(30) # Figure 15.32 Creating an index with namespace wildcards The set of queries that an index can be used for, that is, the index eligibility, depends on how you handle namespaces in queries and index definitions. Figure 15.33 shows four queries that might be able to use some of the four XML indexes that we have discussed. The first query defines and uses a namespace prefix, the second query uses a default namespace, and the third query contains namespace wildcards. The fourth query looks for phone elements that do not have a namespace, which is the same as having an empty namespace. --Query 1 (namespace with prefix): SELECT info FROM customer2 WHERE XMLEXISTS('declare namespace n="http://posample.org"; $INFO/n:customerinfo[n:phone = "416-555-1358"]')# --Query 2 (default namespace): SELECT info FROM customer2 WHERE XMLEXISTS(' declare default element namespace "http://posample.org"; $INFO/customerinfo[phone = "416-555-1358"]')# --Query 3 (namespace wildcards): SELECT info FROM customer2 WHERE XMLEXISTS('$INFO/*:customerinfo[*:phone="416-555-1358"]')# --Query 4 (no namespace): SELECT info FROM customer2 WHERE XMLEXISTS('$INFO/customerinfo[phone = "416-555-1358"]')# Figure 15.33 Four ways of handling namespaces in query predicates Table 15.2 summarizes which queries can use which of the indexes. The four queries in Figure 15.33 are represented by four rows in the table. The four XML indexes in Figure 15.29 through Figure 15.32 correspond to the four columns in the table. The entries marked Y in the table indicate that a certain index is eligible to evaluate a certain query.
15.4 Creating Indexes for XML Data with Namespaces Table 15.2 459 Index Eligibility with Namespaces in XML Indexes and Predicates Index Definition Query Query 1 (namespace prefixes) Query 2 (default namespace) Query 3 (namespace wildcard) Query 4 (no namespace) idx1 (no namespace) N N N Y idx2 (namespace prefix) Y Y N N idx3 (default namespace) Y Y N N idx4 (namespace wildcards) Y Y Y Y The index idx1 cannot be used for query 1, 2, and 3. The reason is that these queries look for phone elements that have a namespace, but index idx1 contains entries for phone elements that do not have a namespace. The rows for query 1 and query 2 have identical entries, and the columns for indexes idx2 and idx3 also have identical entries. This is because declaring namespace prefixes and declaring a default namespace are equivalent. They are just two different notations for the same thing and you can use either one without affecting index matching. The indexes idx2 and idx3 can be used for query 1 and 2 but not for query 3, which uses namespace wildcards, or query 4, which uses no namespaces. The reason is that these indexes contain entries for the one specific namespace that query 1 and 2 are searching for. Indexes idx2 and idx3 do not contain index entries for phone elements in any namespace or no namespace. Index idx4 with namespace wildcards is eligible for all four queries. Since it contains index entries for any namespace, it certainly includes index entries for the specific namespace that query 1 and 2 are searching for. Remember that namespace wildcards also match missing or empty namespaces. Therefore, index idx4 can be used to evaluate query 4. XML attributes often do not belong to a namespace and they never belong to a default namespace. The Cid attribute in the document in Figure 15.17 is an example. To index such an attribute, ensure that you account for the namespace of the element that the attribute belongs to. Three options are shown in Figure 15.34. A fourth option is to use the XMLPATTERN /*:customerinfo/@*:Cid, which even matches Cid attributes that are in any or no namespace.
460 Chapter 15 Managing XML Data with Namespaces CREATE INDEX idxcid ON customer2(info) GENERATE KEYS USING XMLPATTERN 'declare namespace n="http://posample.org"; /n:customerinfo/@Cid' AS SQL DOUBLE # CREATE INDEX idxcid ON customer2(info) GENERATE KEYS USING XMLPATTERN 'declare default element namespace "http://posample.org"; /customerinfo/@Cid' AS SQL DOUBLE # CREATE INDEX idxcid ON customer2(info) GENERATE KEYS USING XMLPATTERN '/*:customerinfo/@Cid' AS SQL DOUBLE # Figure 15.34 15.5 Creating an index on an attribute CONSTRUCTING XML DATA WITH NAMESPACES In Chapter 10, Producing XML from Relational Data, you learned how to construct XML documents from existing relational data. The generated XML documents do not contain any namespaces unless you explicitly construct namespace declarations and prefixes as needed. In this section we first explain how to construct namespaces when you use the SQL/XML publishing functions in DB2 for z/OS or DB2 for Linux, UNIX, and Windows. Then we describe how to create namespaces when you use direct element and attribute constructors in XQuery in DB2 for Linux, UNIX, and Windows. 15.5.1 SQL/XML Publishing Functions and Namespaces The SQL/XML function XMLNAMESPACES can be an argument to the XMLELEMENT and XMLFOREST functions and allows you to construct one or multiple namespace declarations for the constructed documents. These namespace declarations are visible in any nested XML construction so that you do not need to construct the same namespace in every nested publishing function. Note that the XMLNAMESPACES function itself does not declare a namespace. It constructs namespace declarations, which in turn declare namespaces in the generated documents. The query in Figure 15.35 is an example that is slightly extended from the examples shown in section 10.1.1, Constructing XML Elements from Relational Data. It contains the XMLNAMESPACES function to construct a default namespace declaration for the entire generated document. The XMLNAMESPACES function is an argument of the XMLELEMENT function that constructs the root element PRODUCT. Therefore, the default namespace applies to all XML elements in the constructed document. If present, the XMLNAMESPACES function has to be the second argument of the XMLELEMENT function; that is, it appears after the element name but before any XMLATTRIBUTES function.
15.5 Constructing XML Data with Namespaces 461 SELECT XMLELEMENT(NAME "PRODUCT", XMLNAMESPACES(DEFAULT 'http://myproduct.net'), XMLATTRIBUTES(pid), XMLELEMENT(NAME "PRICE", price), XMLELEMENT(NAME "PROMOTION", XMLATTRIBUTES(promoprice), XMLFOREST(promostart, promoend) ) ) FROM product WHERE pid = '100-100-01'; <PRODUCT xmlns="http://myproduct.net" PID="100-100-01"> <PRICE>9.99</PRICE> <PROMOTION PROMOPRICE="7.25"> <PROMOSTART>2004-11-19</PROMOSTART> <PROMOEND>2004-12-19</PROMOEND> </PROMOTION> </PRODUCT> 1 record(s) selected. Figure 15.35 Constructing default namespace declaration For another example, suppose that the constructed PROMOTION element and all its child elements have to be in the separate namespace http://mypromo.net. This can be achieved in several ways. One approach is to use a second XMLNAMESPACES function as an argument to the XMLELEMENT function that constructs the PROMOTION element. This XMLNAMESPACES function constructs a new default namespace that overrides the top-level default namespace (see Figure 15.36). SELECT XMLELEMENT(NAME "PRODUCT", XMLNAMESPACES(DEFAULT 'http://myproduct.net'), XMLATTRIBUTES(pid), XMLELEMENT(NAME "PRICE", price), XMLELEMENT(NAME "PROMOTION", XMLNAMESPACES(DEFAULT 'http://mypromo.net'), XMLATTRIBUTES(promoprice), XMLFOREST(promostart, promoend) ) ) FROM product WHERE pid = '100-100-01'; <PRODUCT xmlns="http://myproduct.net" PID="100-100-01"> <PRICE>9.99</PRICE> <PROMOTION xmlns="http://mypromo.net" PROMOPRICE="7.25"> <PROMOSTART>2004-11-19</PROMOSTART> <PROMOEND>2004-12-19</PROMOEND> </PROMOTION> </PRODUCT> Figure 15.36 Constructing multiple default namespace declarations
462 Chapter 15 Managing XML Data with Namespaces Another option for producing a different namespace for part of the document is to construct multiple namespace declarations in a single XMLNAMESPACES function at the top of the document. The query in Figure 15.37 constructs a default namespace declaration for the URI http://myproduct.net as well as a namespace prefix declaration for the URI http:// mypromo.net with the prefix promo. In the nested XMLELEMENT and XMLFOREST functions the prefix promo is explicitly added to the generated element names to assign certain elements to that namespace. All other elements belong to the default namespace. SELECT XMLELEMENT(NAME "PRODUCT", XMLNAMESPACES(DEFAULT 'http://myproduct.net', 'http://mypromo.net' AS "promo" ), XMLATTRIBUTES(pid), XMLELEMENT(NAME "PRICE", price), XMLELEMENT(NAME "promo:PROMOTION", XMLATTRIBUTES(promoprice), XMLFOREST(promostart AS "promo:PROMOSTART", promoend AS "promo:PROMOEND") ) ) FROM product WHERE pid = '100-100-01'; <PRODUCT xmlns="http://myproduct.net" xmlns:promo="http://mypromo.net" PID="100-100-01"> <PRICE>9.99</PRICE> <promo:PROMOTION PROMOPRICE="7.25"> <promo:PROMOSTART>2004-11-19</promo:PROMOSTART> <promo:PROMOEND>2004-12-19</promo:PROMOEND> </promo:PROMOTION> </PRODUCT> Figure 15.37 15.5.2 Constructing multiple namespace declarations XQuery Constructors and Namespaces The direct element and attribute constructors of the XQuery language were introduced in section 8.4, Constructing XML Data, and further elaborated on in section 10.2, Using XQuery Constructors with Relational Input. Direct element and attribute constructors allow you to simply type the tags of the XML documents the way you want them to be constructed and nested. In the same manner you can type namespace declarations into the start tags of element constructors the way you want them to appear in the document. This is shown by the three queries in Figure 15.38, which construct the same XML documents as the queries in Figure 15.35, Figure 15.36, and Figure 15.37 in the previous section, respectively. Remember that element and attribute values can be obtained from XML or relational columns and are specified by expressions in curly brackets. A simple and common type of expression is a relational column name used as an uppercase variable that starts with a $ sign, such as $PID.
15.6 Updating XML Data with Namespaces 463 SELECT XMLQUERY(' <PRODUCT xmlns="http://myproduct.net" PID="{$PID}"> <PRICE>{$PRICE}</PRICE> <PROMOTION PROMOPRICE="{$PROMOPRICE}"> <PROMOSTART>{$PROMOSTART}</PROMOSTART> <PROMOEND>{$PROMOEND}</PROMOEND> </PROMOTION> </PRODUCT>') FROM product WHERE pid = '100-100-01'; SELECT XMLQUERY(' <PRODUCT xmlns="http://myproduct.net" PID="{$PID}"> <PRICE>{$PRICE}</PRICE> <PROMOTION xmlns="http://mypromo.net" PROMOPRICE="{$PROMOPRICE}"> <PROMOSTART>{$PROMOSTART}</PROMOSTART> <PROMOEND>{$PROMOEND}</PROMOEND> </PROMOTION> </PRODUCT>') FROM product WHERE pid = '100-100-01'; SELECT XMLQUERY(' <PRODUCT xmlns="http://myproduct.net" xmlns:promo="http://mypromo.net" PID="{$PID}"> <PRICE>{$PRICE}</PRICE> <promo:PROMOTION PROMOPRICE="{$PROMOPRICE}"> <promo:PROMOSTART>{$PROMOSTART}</promo:PROMOSTART> <promo:PROMOEND>{$PROMOEND}</promo:PROMOEND> </promo:PROMOTION> </PRODUCT>') FROM product WHERE pid = '100-100-01'; Figure 15.38 15.6 Three ways of constructing namespace in XQuery UPDATING XML DATA WITH NAMESPACES In this section we discuss the handling of namespaces when you update XML documents with XQuery update expressions in DB2 for Linux, UNIX, and Windows. This section assumes that you are familiar with XML updates, which were discussed in Chapter 12, Updating and Transforming XML Documents. When you update XML data that contains namespaces, you must specify namespace declarations in the XQuery Update expressions. Otherwise the elements or attributes that you want to update are not found, which typically causes UPDATE statements to fail. As we explained in Chapter 12, replacing a node, replacing the value of a node, renaming a node, and inserting a node fails if the
464 Chapter 15 Managing XML Data with Namespaces target path in the XQuery Update expression does not exist in the document that you try to update. Only the delete operation behaves differently. If you try to delete an element or attribute that does not exist in the document, the UPDATE statement succeeds and the document remains unchanged. Let’s look at some examples based on the sample document in Figure 15.17. 15.6.1 Updating Values in Documents with Namespaces The document in Figure 15.17 contains a default namespace, which applies to all elements. The UPDATE statement in Figure 15.39 does not declare an appropriate namespace. Hence, the target path $new/customerinfo/phone[@type = "home"] produces an empty sequence and the statement fails. UPDATE customer2 SET info = XMLQUERY(' copy $new := $INFO modify do replace value of $new/customerinfo/phone[@type = "work"] with "123-456-7890" return $new') WHERE cid = 1000 # SQL16085N The target node of an XQuery "replace value of" expression is not valid. Error QName=err:XUTY0008. Figure 15.39 Update expression without namespace declaration There are three ways to avoid this error, as shown in Figure 15.40. The first UPDATE statement in Figure 15.40 declares an appropriate default element namespace that applies to all elements in the update expression. The second statement defines and uses the namespace prefix po to achieve the same effect. Note that the prefix is not used for the attribute type, which does not belong to any namespace in the document that is being updated. The third statement in Figure 15.40 uses namespace wildcards to match elements in any namespace, including the empty namespace. UPDATE customer2 SET info = XMLQUERY(' declare default element namespace "http://posample.org"; copy $new := $INFO modify do replace value of $new/customerinfo/phone[@type = "work"] with "123-456-7890" return $new') WHERE cid = 1000 # Figure 15.40 Update expression with proper namespace handling
15.6 Updating XML Data with Namespaces 465 UPDATE customer2 SET info = XMLQUERY(' declare namespace po="http://posample.org"; copy $new := $INFO modify do replace value of $new/po:customerinfo/po:phone[@type = "work"] with "123-456-7890" return $new') WHERE cid = 1000 # UPDATE customer2 SET info = XMLQUERY(' copy $new := $INFO modify do replace value of $new/*:customerinfo/*:phone[@type = "work"] with "123-456-7890" return $new') WHERE cid = 1000 # Figure 15.40 15.6.2 Update expression with proper namespace handling (Continued) Renaming Nodes in Documents with Namespace Prefixes Additional attention is required when you rename elements or attributes that belong to a namespace. Renaming elements in a document with a default namespace behaves differently from renaming elements in a document with namespace prefixes. First consider a document with namespace prefixes, such as the original document in Figure 15.41, and suppose you want to rename the element addr as address. The UPDATE statement declares the correct namespace and uses the namespace prefix po in the target path. However, the new name “address” in the rename expression is a local name without a namespace. As a result, the original element with the local name addr and the namespace URI http://posample.org is renamed to the local name address and an empty namespace. If the intention is to change the local name of an element but not to change its namespace, then the new name provided in the rename expression must be a full name consisting of the correct namespace and the new local name. For example, the UPDATE statement in Figure 15.42 adds the declared namespace prefix po to the new element name. Due to this prefix, the new name “po:address” contains the same namespace as the original element addr and all other elements in the document. The updated document now has two different prefixes for the same namespace URI. This is not a problem because what matters are the identical URIs, irrespective of the prefixes, and that all elements belong to the same namespace as intended.
466 Chapter 15 Managing XML Data with Namespaces UPDATE customer2 SET info = XMLQUERY( 'declare namespace po="http://posample.org"; copy $new := $INFO modify do rename $new/po:customerinfo/po:addr as "address" return $new') WHERE cid = 1000 # Original document Updated document <c:customerinfo xmlns:c="http://posample.org" <c:customerinfo xmlns:c="http://posample.org" Cid="1000"> Cid="1000"> <c:name>Kathy Smith</c:name> <c:name>Kathy Smith</c:name> <c:addr country="Canada"> <address country="Canada"> <c:street>5 Rosewood</c:street> <c:street>5 Rosewood</c:street> c:city>Toronto</c:city> <c:city>Toronto</c:city> <<c:prov-state>Ontario</c:prov-state> <c:prov-state>Ontario</c:prov-state> <c:pcode-zip>M6W 1E6</c:pcode-zip> <c:pcode-zip>M6W 1E6</c:pcode-zip> </address> </c:addr> </c:customerinfo> </c:customerinfo> Figure 15.41 Renaming an element in a document with namespace prefixes UPDATE customer2 SET info = XMLQUERY( 'declare namespace po="http://posample.org"; copy $new := $INFO modify do rename $new/po:customerinfo/po:addr as "po:address" return $new') WHERE cid = 1000 # Original document Updated document <c:customerinfo xmlns:c="http://posample.org" <c:customerinfo xmlns:c="http://posample.org" Cid="1000"> Cid="1000"> <c:name>Kathy Smith</c:name> <c:name>Kathy Smith</c:name> <po:address xmlns:po="http://posample.org" <c:addr country="Canada"> country="Canada"> <c:street>5 Rosewood</c:street> <c:street>5 Rosewood</c:street> <c:city>Toronto</c:city> <c:city>Toronto</c:city> <c:prov-state>Ontario</c:prov-state> <c:pcode-zip>M6W 1E6</c:pcode-zip> <c:prov-state>Ontario</c:prov-state> <c:pcode-zip>M6W 1E6</c:pcode-zip> </c:addr> </c:customerinfo> </po:address> </c:customerinfo> Figure 15.42 Using a namespace prefix in the new element name Introducing the second namespace prefix in the updated document can be avoided only if the UPDATE statement declares the same namespace prefix as the original document (see Figure 15.43). Another solution is to declare the correct default namespace in the UPDATE statement.
15.6 Updating XML Data with Namespaces 467 UPDATE customer2 SET info = XMLQUERY( 'declare namespace c="http://posample.org"; copy $new := $INFO modify do rename $new/c: customerinfo/c: addr as "c:address " return $new') WHERE cid = 1000 # Original document Updated document <c:customerinfo xmlns:c="http://posample.org" <c:customerinfo xmlns:c="http://posample.org" Cid="1000"> Cid="1000"> <c:name>Kathy Smith</c:name> <c:name>Kathy Smith</c:name> <c:address country="Canada"> <c:addr country="Canada"> <c:street>5 Rosewood</c:street> <c:street>5 Rosewood</c:street> <c:city>Toronto</c:city> <c:city>Toronto</c:city> <c:prov-state>Ontario</c:prov-state> <c:prov-state>Ontario</c:prov-state> <c:pcode-zip>M6W 1E6</c:pcode-zip> <c:pcode-zip>M6W 1E6</c:pcode-zip> </c:address> </c:addr> </c:customerinfo> </c:customerinfo> Figure 15.43 15.6.3 Same namespace prefix in the update statement and document Renaming Nodes in Documents with Default Namespaces Now let’s look at renaming elements in a document with a default namespace, such as the one in Figure 15.44. Remember that a default namespace has no prefix, which is the same as an empty prefix. <customerinfo xmlns="http://posample.org" Cid="1000"> <name>Kathy Smith</name> <addr country="Canada"> <street>5 Rosewood</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M6W 1E6</pcode-zip> </addr> <phone type="work">416-555-1358</phone> </customerinfo> Figure 15.44 Sample document with a default namespace Let’s again rename the element addr as address. The UPDATE statement in Figure 15.45 declares the correct namespace and uses the namespace prefix po in the target path. The new name “address” in the rename expression is a local name without a namespace and without a prefix; that is, both namespace and prefix of the new name are empty. This causes a conflict because in the target document the empty prefix is already associated with the default namespace http://posample.org and cannot also be associated with the empty namespace at the same time. Therefore the update fails and error SQL16088N is returned.
468 Chapter 15 Managing XML Data with Namespaces UPDATE customer2 SET info = XMLQUERY( 'declare namespace po="http://posample.org"; copy $new := $INFO modify do rename $new/po:customerinfo/po:addr as "address" return $new') WHERE cid = 1000 # SQL16088N A "rename" expression has a binding of a namespace prefix "" to namespace URI "", introduced to an element named "addr", that conflicts with an existing namespace binding of the same prefix to a different URI in the in-scope namespaces of that element node. Error QName=err:XUDY0023. SQLSTATE=10708 Figure 15.45 Renaming an element can cause namespace conflicts. To avoid the error in Figure 15.45, the new element name in the rename expression must be in the same namespace as the default namespace of the target document. There are two ways to achieve this: • Add the declared namespace prefix po to the new element name, so that it matches the namespace in the document: rename $new/po:customerinfo/po:addr as "po:address" This rename expression clearly ensures that po:addr and po:address are in the same namespace and only the local name of the element is changed, without interfering with its namespace. • In the UPDATE statement, declare the namespace http://posample.org as a default namespace instead of a namespace with prefix. The new name “address” then assumes this default namespace and matches the namespace in the target document. 15.6.4 Inserting and Replacing Nodes in Documents with Namespaces When you insert elements or attributes into a document, or if you replace existing nodes with new nodes, similar namespace considerations apply as for renaming elements. For example, if you want to insert the new element <email>kathy@ibm.com</email> into a document, be aware of the namespace for this new element. The UPDATE statement in Figure 15.46 declares a namespace with a prefix but does not use the prefix for the new element. Therefore the new email element does not belong to any namespace, which is equivalent to the empty namespace. In the updated document the element therefore contains the empty namespace declaration xmlns="". This undeclares the default namespace of the document and ensures that the email element does not belong to the default namespace. Chances are that this is not what you wanted.
15.7 Summary 469 UPDATE customer2 SET info = XMLQUERY( 'declare namespace po="http://posample.org"; copy $new := $INFO modify do insert <email>kathy@ibm.com</email> as last into $new/po:customerinfo return $new') WHERE cid = 1000 # Original document <customerinfo xmlns="http://posample.org" Cid="1000"> <name>Kathy Smith</name> <addr country="Canada"> <street>5 Rosewood</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M6W 1E6</pcode-zip> </addr> </customerinfo> Figure 15.46 Updated document <customerinfo xmlns="http://posample.org" Cid="1000"> <name>Kathy Smith</name> <addr country="Canada"> <street>5 Rosewood</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M6W 1E6</pcode-zip> <email xmlns="">kathy@ibm.com</email> </addr> </customerinfo> Inserting an element without a namespace In most cases you probably want the new email element to belong to the default namespace of the target document. You achieve this in one of two ways: • Add the declared namespace prefix po to the new element name, so that it matches the namespace in the document: do insert <po:email>kathy@ibm.com</po:email> • In the UPDATE statement, declare the namespace http://posample.org as a default namespace instead of a namespace with prefix. The new email element then assumes this default namespace and matches the namespace in the target document. Other insert, replace, and rename scenarios with namespaces are variations of the cases that we have discussed. We encourage you to work with the DB2 sample database hands-on and to test out these and other update scenarios. XML namespaces are often perceived as a difficult area in the XML world, and some hands-on experiments are the best way to become comfortable with them. 15.7 SUMMARY XML namespaces are a standard that allows the designer of XML documents or XML Schemas to define unique element and attribute names and to group them together into a well-defined vocabulary of XML tags. Namespaces greatly help to avoid conflicts between tag names in XML documents that come from multiple sources. The full name of an XML element or attribute always consists of a namespace and a local name. If an XML element does not belong to a namespace then the namespace part of its name is empty.
470 Chapter 15 Managing XML Data with Namespaces When you use XPath expressions to query, index, or update XML documents that contain namespaces, then these XPath expressions must contain namespace declarations. Otherwise they do not identify any nodes in the XML documents. You can declare a namespace with a prefix and use that prefix for every element name that belongs to that namespace. Alternatively, you can declare a default namespace that applies to all elements in the XPath expression without the use of prefixes. Remember that a default namespace never applies to attributes. You can also use namespace wildcards (*:) to match elements regardless of their namespace. Namespace wildcards can be very convenient to avoid namespace declarations altogether. However, if you have a mix of documents from multiple namespaces and you want to query only documents in one specific namespace, then you must declare that specific namespace and cannot use wildcards. Namespaces are closely related to XML Schemas. An XML Schema can define a target namespace to declare that all elements (and optionally also all attributes) defined in the schema belong to this specific namespace. This and other aspects of XML Schemas are explained in the next chapter.
C H A P T E R 16 Managing XML Schemas ML Schemas are commonly used to define what XML documents are allowed to look like in terms of their structure, element and attribute names, data types, and other document characteristics. Due to their rich capabilities for defining XML document constraints, XML Schemas are the preferred instrument for enforcing XML data quality. The XML Schema language is a widely adopted standard and supported in many tools and middleware software products. The use of XML Schemas in DB2 is optional and we will explain when and how to use them. X Our discussion of XML Schemas is split into two chapters. This chapter introduces XML Schemas and focuses on registering and managing them in DB2. Chapter 17, Validating XML Documents against XML Schemas, then elaborates further on the use of XML Schemas to validate XML documents in operations such as insert, load, or update. This chapter is organized along the following topics: • Introduction to XML Schemas and considerations for their usage (section 16.1) • A detailed look at two XML Schemas, a simple one (section 16.2) and a more complex one that consists of multiple schema documents (section 16.3) • Registering and removing XML Schemas in DB2’s XML Schema Repository (sections 16.4 and 16.5) • XML Schema evolution (section 16.6) • Usage privileges for XML Schemas in DB2 (section 16.7) • Document Type Definitions (DTDs) and external entities (section 16.8) 471
472 Chapter 16 Managing XML Schemas • Understanding and querying the tables in the XML Schema Repository (section 16.9) • Additional considerations for managing XML Schemas in DB2 for z/OS (section 16.10) Although we provide an introduction to XML Schemas in this chapter, a complete coverage of all aspects and facets of XML Schemas is beyond the scope of this book. Instead we focus on DB2’s capabilities for handling XML Schemas. References to detailed resources about XML Schemas in general are provided in Appendix C, Further Reading. 16.1 INTRODUCTION TO XML SCHEMAS AND THEIR USAGE Roughly speaking, an XML Schema is a specific type of XML document that defines the characteristics and structure of other XML documents. An XML Schema can be used to define some or all of the following: • The allowed element and attribute names as well as the structure in which these elements and attributes can be nested. • The namespace(s) that all the defined elements and/or attributes belong to. • Mandatory or optional occurrences of elements and attributes. For example, you can define that each customerinfo document has to have a name element, and that phone elements are optional. • The minimum and/or maximum number of occurrences of an element in a document. For example, you can define that a customer cannot have more than one name, but that multiple phone numbers are allowed. • The allowed data types for some or all of the elements and attributes in a document. For example, you can define the customer name to be a string of at most 30 characters, and the customer ID to be a positive integer. • The allowed pattern for certain values. For example, you can define that the value of the phone element has to be of the form xxx-xxx-xxxx. • Derived or complex data types for your elements and attributes, as well as default values. An example of a derived data type is an integer type restricted to the range of values from 1 to 100. • The exclusive choice between two or more elements. For example, you can define that a customer can have a secretary element or an assistant element, but not both. • That certain branches of the document can contain any elements, even if they are not defined in the XML Schema. This flexibility allows documents to be extensible and still compliant with the XML Schema. The preceding list is not exhaustive, as there are many more aspects of a document that can be defined or constrained in an XML Schema. Three very important concepts of XML Schemas are
16.1 Introduction to XML Schemas and Their Usage 473 • The degree to which an XML Schema defines the characteristics of XML documents can be very loose to allow for a lot of flexibility, or very strict to tightly control the XML data in every aspect, or anything in between. • The use of XML Schemas is optional. You can use XML Schemas but you don’t have to. There is no penalty in terms of DB2 performance or functionality if you don’t use an XML Schema. • XML Schemas define constraints that are applied to one XML document at a time. Today there is no standardized notation or method that defines constraints across multiple XML documents. 16.1.1 Valid Versus Well-Formed XML Documents If a document complies with a given XML Schema, then this document is valid with respect to that particular schema. If you have two different XML Schemas, a given document might be valid with respect to one schema but not the other. The process of determining whether an XML document is valid for a given XML Schema is called validation or schema validation. Validation is an optional part of XML parsing. An XML parser can parse an XML document with or without comparing it to an XML Schema. Parsing with schema validation consumes more CPU cycles than without. Hence, validation can have a performance impact, especially in CPU-bound environments. Be aware of the difference between a valid and a well-formed XML document. A well-formed document does not have to comply with any particular XML Schema. A document is well-formed if the XML syntax of the document is correct. For example, all start tags must have a corresponding end tag, elements must be properly nested, attribute values must be in quotes, no reserved characters are used, and so on. A document is well-formed if it can be parsed by an XML parser without errors. If a document is not well-formed then it’s not considered an XML document. XML documents that are not well-formed cannot be processed and need to be corrected or discarded. If you attempt to insert a non–well-formed document into an XML column in DB2, the document will be rejected with an error message that indicates why the document isn’t wellformed. The complete list of formal requirements for a document to be well-formed is given at http://www.w3.org/TR/xml. A document is valid if it is well-formed and it complies with a particular XML Schema. Hence, validity is stronger than well-formedness. Every valid document is also a well-formed document. A document cannot be valid if it is not well-formed. XML Schemas are often used to define agreed-upon ways to exchange information between organizations or between departments and applications within a single company. You can often view an XML Schema as a contract that says “if you give me data in this specific format, then I know how to deal with it.”
474 16.1.2 Chapter 16 Managing XML Schemas To Validate or Not to Validate,That Is the Question! Whenever you insert, load, or update an XML document into a table you can choose to validate the document against an XML Schema. You can also choose to validate XML documents in queries. The decision whether to perform validation depends on various factors. You may want to validate XML documents in DB2 if you receive the documents from an unreliable source and you need to ensure that the data that enters your database adheres to a specific schema. Validation is also a good way to ensure that XML documents in DB2 are still valid after they have been updated by an application program. If your applications expect documents that comply with a specific schema, then validation is important to avoid application errors. You might prefer to avoid schema validation in DB2 if you receive XML documents from a trusted source. For example, if XML documents are inserted and updated by internal applications that have been well-tested, validation can often be avoided to reduce CPU consumption. Another common scenario is that XML documents are validated in other layers of your infrastructure, such as the application server, the enterprise service bus, or a message broker. If that already guarantees documents to be valid, then additional validation in DB2 might not be required. 16.1.3 Custom Versus Industry Standard XML Schemas Where do XML Schemas come from? There are multiple answers to this question. You can certainly write your own XML Schema to constrain XML data according to the requirements of your application. In this context you might wonder whether the XML Schema should be defined by the DBAs or by the application designer. Clearly, the XML Schema needs to be defined to meet the application requirements and should be designed primarily by people with subjectmatter knowledge of the application. An XML Schema should not be defined in an attempt to optimize how documents are stored and processed in DB2. DB2 pureXML is designed to handle XML data for any XML Schema. Modeling business data with an XML Schema happens at the logical level and needs to focus on the business requirements of your application, not on how DB2 processes XML. Applications typically process business objects such as orders, tax returns, medical records, newspaper articles, insurance claims, patents, or others. Most applications work best if each individual business object is represented by a separate XML document, which often leads to a large number of small XML documents. Just like DB2 can handle relational tables with large numbers of rows efficiently, DB2 pureXML is well-suited to manage large collections of XML documents. Designing XML Schemas is best done with design tools such as IBM Data Studio, IBM Rational Application Developer, or Altova XMLSPY, which are described in Chapter 21, Developing XML Applications with DB2. Such tools also allow you to generate an XML Schema based on existing XML documents.
16.1 Introduction to XML Schemas and Their Usage 475 In many cases you won’t have the luxury to define your own XML format with an XML Schema. Other organizations or business partners might already have established a specific XML format that you are required to consume. Today, every major vertical industry has defined one or multiple XML Schemas to standardize the data and data exchange formats in that particular industry. Some of them are listed in Table 16.1. Table 16.1 Industry Standard XML Schemas Name Industry Purpose/Comment FpML Financial Derivatives trading FIXML Financial Securities UNIFI (ISO 20022) Financial SEPA (Single Euro Payments Area) SwiftXML Financial Financial messaging MISMO Financial Loans and mortgages Origo Financial Life insurance and pensions ACORD Insurance Document standard in the insurance sector HL7 Health Care Document standard for medical and clinical data CDISC Health Care Clinical laboratory data ARTS Retail General retail STAR Retail Automotive retail NewsML Media/Publishing Creation, transfer, delivery of news DITA Media/Publishing Darwin Information Typing Architecture DOCBOOK Media/Publishing Document authoring SVG Media/Publishing Scalable Vector Graphics GJXDM Government Global Justice XML Data Model TAX1120 Government IRS e-File Form 1120, for corporate tax NIEM Government National Information Exchange Model OTA Travel OpenTravel Alliance PIDX Energy Petroleum Industry Data Exchange OAGIS Cross-Industry Business object documents (BODs)
476 Chapter 16 Managing XML Schemas IBM has developed packages with sample scripts for a variety of these industry standards XML Schemas. These packages give you a jumpstart for registering these XML Schemas and validating sample documents. They are available at http://www.alphaworks.ibm.com/tech/purexml/ download. 16.2 ANATOMY OF AN XML SCHEMA An XML Schema can consist of one or multiple schema documents, each of which is an XML document with special characteristics. Let’s first look at the simple XML Schema customer.xsd in Figure 16.1, which consists of a single schema document. The XML instance document in Figure 16.2 is valid with respect to this XML Schema. The first four lines of the XML Schema in Figure 16.1 contain the root element xs:schema and several namespace declarations. The declaration xmlns:xs="http://www.w3.org/ 2001/XMLSchema" binds the namespace prefix xs to the XML Schema URI. All elements in the schema, such as schema, complexType, element, attribute, and so on, are prefixed with xs: to indicate that they have a specific meaning, as defined in the XML Schema specification of the W3C (see link in Appendix C, Further Reading). This use of the XML Schema namespace and the use of the elements that belong to this namespace make the XML document in Figure 16.1 an XML Schema rather than a regular XML instance document. In fact, the namespace URI http://www.w3.org/2001/XMLSchema refers to the schema for XML Schemas, which defines what XML Schemas can look like. The target namespace defined in the second line of Figure 16.1 indicates that all the elements declared in this XML Schema belong to a specific namespace; that is, the namespace with the URI http://pureXMLcookbook.org. XML documents that want to be compliant with this schema have to declare this namespace (see Figure 16.2 as an example). The fourth line of the schema, elementFormDefault="qualified", mandates that not only globally but also locally declared elements need to be qualified with a namespace when they appear in an instance document. You’ll see in a minute what that means. <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://pureXMLcookbook.org" xmlns="http://pureXMLcookbook.org" elementFormDefault="qualified" > <xs:complexType name="phoneType"> <xs:simpleContent> <xs:extension base="xs:string"> <xs:attribute name="type" type="xs:string" use="required"/> </xs:extension> </xs:simpleContent> </xs:complexType> Figure 16.1 The XML Schema customer.xsd
16.2 Anatomy of an XML Schema 477 <xs:complexType name="addrType"> <xs:sequence> <xs:element name="street" type="xs:string"/> <xs:element name="city" type="xs:string"/> <xs:element name="prov-state" type="xs:string"/> <xs:element name="pcode-zip" type="xs:string"/> </xs:sequence> <xs:attribute name="country" type="xs:string"/> </xs:complexType> <xs:complexType name="assisType"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="phone" type="phoneType" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:element name="customerinfo"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="addr" type="addrType"/> <xs:element name="phone" type="phoneType" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="assistant" type="assisType" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="Cid" type="xs:integer" /> </xs:complexType> </xs:element> </xs:schema> Figure 16.1 The XML Schema customer.xsd (Continued) The body of the XML Schema in Figure 16.1 consists of four main blocks. The first three of them start with xs:complexType because they define complex data types that are used later in the schema. The first of these complex types is called phoneType. The declaration of phoneType says that an element of this type will have simple content (no child elements). In this example, phone elements have a value of type xs:string and an attribute called type whose data type is also xs:string. This attribute is not optional. It’s required because it is declared with use="required". The second complex type defines the addrType. This type of declaration says that an element of type addrType has to contain a sequence of exactly four child XML elements. The child elements must have the names street, city, prov-state, and pcode-zip. All of them are of type xs:string, which means they can contain any text value. These four elements are declared as an xs:sequence, which means they have to occur in the specified order. The addrType further defines that an element of this type can have an optional attribute called country of type xs:string.
478 Chapter 16 Managing XML Schemas The elements street, city, prov-state, and pcode-zip are considered local elements, because their definition is local to the complex type addrType and not visible elsewhere. Although the addrType is globally declared, the elements inside it are locally declared. The declaration elementFormDefault="qualified" at the top of the schema requires these elements to be qualified by the target namespace when they appear in an instance document. This is true in Figure 16.2, which uses a default namespace that qualifies all elements in the document. The third complex type is the assisType. It declares that an element of type assisType • Must have a child element called name of type xs:string. • Can have zero or more child elements called phone that must be of type phoneType, which was defined earlier. The schema attribute minOccurs="0" indicates that the phone element is optional. If an element is declared without explicit minOccurs or maxOccurs indicators then the default value is 1, which means the element has to occur exactly once. The fourth and last block in this schema defines that a document that wants to be compliant with this schema has to have a root element customerinfo. This customerinfo element is globally declared in the schema. It must have at least two child elements. They have to be name of type xs:string, and addr of type addrType, which were defined earlier. The customerinfo element can, optionally, also contain any number of phone and assistant elements, which have to be of type phoneType and assisType respectively. If they exist, all assistant elements have to appear after any phone elements, as mandated by the XML Schema construct xs:sequence. Finally, the customerinfo element can also contain the optional attribute Cid of type xs:integer. Given the XML Schema shown in Figure 16.1, a valid XML document is shown in Figure 16.2. <customerinfo xmlns="http://pureXMLcookbook.org" Cid="1004"> <name>Matt Foreman</name> <addr country="Canada"> <street>1596 Baseline</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M3Z 5H9</pcode-zip> </addr> <phone type="work">905-555-4789</phone> <phone type="home">416-555-3376</phone> <assistant> <name>Gopher Runner</name> <phone type="home">416-555-3426</phone> </assistant> </customerinfo> Figure 16.2 A valid document for the XML Schema customer.xsd
16.3 An XML Schema with Include and Import 16.3 479 AN XML SCHEMA WITH INCLUDE AND IMPORT In many real-world applications, XML data and hence XML Schemas are a lot more complex than the one shown in Figure 16.1. To make the design and handling of complex XML Schemas easier, it is often desirable to divide their content among several schema documents. This approach is similar to application programs whose source code is divided across a number of distinct files, or modules, which can be included into other files as needed to build more complex applications. In the same manner, an XML Schema can consist of multiple schema documents. As an example, let’s take the schema in Figure 16.1 and move the definition of phone elements and addresses into separate schema documents, phone.xsd and addr.xsd. Figures 16.3 shows the schema document phone.xsd, which only declares the complex type phoneType. This schema document declares a global type but does not define any global elements and therefore cannot be used by itself for any validation. It can only serve as a module that is used in other schemas. <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:complexType name="phoneType"> <xs:simpleContent> <xs:extension base="xs:string"> <xs:attribute name="type" type="xs:string" use="required"/> </xs:extension> </xs:simpleContent> </xs:complexType> </xs:schema> Figure 16.3 Content of the schema document phone.xsd The schema document addr.xsd in Figure 16.4 is conceptually different from phone.xsd. One difference is that it declares a global element (addr) and not just a complex type. Therefore, this schema document can be used by itself to validate XML documents. For example, the document in Figure 16.5 is a valid instance document for addr.xsd. Another notable property of the schema document in Figure 16.4 is that it defines its own target namespace. You will see shortly that this makes a difference when this schema document is used as a module in a larger schema. Also note that the XML document in Figure 16.5 declares the default namespace http:// pureXMLcookbook.org/addr for all its elements. This namespace is the target namespace defined in addr.xsd and required for the document to be valid for this schema. Remember that attributes never belong to a default namespace.
480 Chapter 16 Managing XML Schemas <xs:schema targetNamespace="http://pureXMLcookbook.org/addr" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://pureXMLcookbook.org/addr" elementFormDefault="qualified"> <xs:element name="addr"> <xs:complexType> <xs:sequence> <xs:element name="street" type="xs:string"/> <xs:element name="city" type="xs:string"/> <xs:element name="prov-state" type="xs:string"/> <xs:element name="pcode-zip" type="xs:string"/> </xs:sequence> <xs:attribute name="country" type="xs:string"/> </xs:complexType> </xs:element> </xs:schema> Figure 16.4 Content of the schema document addr.xsd A valid document for the XML Schema in Figure 16.4 is shown in Figure 16.5. <addr xmlns="http://pureXMLcookbook.org/addr" country="Canada"> <street>1 Young Street</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M5W-IE6</pcode-zip> </addr> Figure 16.5 A valid document for addr.xsd You can now write a schema document that uses both phone.xsd and addr.xsd as building blocks. Such a schema document is customer2.xsd in Figure 16.6. The line <xs:include schemaLocation="phone.xsd"/> includes the schema document phone.xsd. This xs:include element makes all definitions and declarations in phone.xsd locally available in customer2.xsd. Subsequently, the phoneType can be used as if it was defined locally. All elements in the included schema document (phone.xsd) automatically take on the namespace of the including schema document (customer2.xsd). Therefore the include mechanism can only be used to pull in a schema document that does not define a target namespace or whose target namespace is the same as the target namespace in the including schema. The xs:import element allows you to pull in a schema document that defines a target namespace that is different from the target namespace in the including schema. The import mechanism enables schema components from different target namespaces to be used together, and hence enables the validation of XML documents that combine structures from multiple namespaces. In the XML Schema literature you will find further subtle differences between xs:include and xs:import. For example, strictly speaking xs:include includes a schema document, but xs:import imports a namespace.
16.3 An XML Schema with Include and Import 481 In Figure 16.6, customer2.xsd uses xs:import to make locally available the definition of the addr element from the namespace http://pureXMLcookbook.org/addr. Since this is a new namespace, customer2.xsd assigns the prefix address to it, so that objects from this namespace can be properly qualified. This prefix is used further down in the schema document where the line <xs:element ref="address:addr"/> references the addr element that is globally defined in addr.xsd. We call the schema document customer2.xsd the primary schema document because it is at the top of the hierarchy of the include and import dependencies among several schema documents. Note that the included and imported schema documents themselves can also include or import other schema documents. Include dependencies within any given namespace must not be circular. Import relationships between namespaces are allowed to be circular. <xs:schema targetNamespace="http://pureXMLcookbook.org" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://pureXMLcookbook.org" xmlns:address="http://pureXMLcookbook.org/addr" elementFormDefault="qualified"> <xs:include schemaLocation="phone.xsd"/> <xs:import namespace="http://pureXMLcookbook.org/addr" schemaLocation="addr.xsd"/> <xs:complexType name="assisType"> <xs:sequence> <xs:element name="name" type="xs:string" minOccurs="0" /> <xs:element name="phone" type="phoneType" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:element name="customerinfo"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element ref="address:addr"/> <xs:element name="phone" type="phoneType" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="assistant" type="assisType" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="Cid" type="xs:integer" /> </xs:complexType> </xs:element> </xs:schema> Figure 16.6 Schema document customer2.xsd refers to other schema documents. The relationship between the schema documents customer2.xsd, phone.xsd, and addr.xsd is illustrated in Figure 16.7.
482 Chapter 16 Managing XML Schemas Customer2.xsd phone.xsd Figure 16.7 <xs:schema targetNamespa... xmlns:xs=... <xs:include phone.xsd … <xs:import addr.xsd … ... </xs:schema> addr.xsd Multiple XML Schema documents comprise one XML Schema The XML document in Figure 16.8 is valid with respect to the XML Schema customer2.xsd. The document declares http://pureXMLcookbook.org as the default element namespace. Since the addr element comes from a different namespace, the document also declares the namespace http://pureXMLcookbook.org/addr and assigns it the prefix address. This prefix is used for the addr element and all its children to override the default namespace. Since the schema combines declarations from two different namespaces, the same must be reflected in the instance document that is valid for this schema. The schema document phone.xsd did not define its own namespace, and therefore the phone elements in the instance document belong to the default namespace. This namespace is the target namespace of the schema document customer2.xsd, which includes phone.xsd. <customerinfo xmlns="http://pureXMLcookbook.org" Cid="1004" xmlns:address="http://pureXMLcookbook.org/addr"> <name>Matt Foreman</name> <address:addr country="Canada"> <address:street>1596 Baseline</address:street> <address:city>Toronto</address:city> <address:prov-state>Ontario</address:prov-state> <address:pcode-zip>M3Z 5H9</address:pcode-zip> </address:addr> <phone type="work">905-555-4789</phone> <phone type="home">416-555-3376</phone> <assistant> <name>Gopher Runner</name> <phone type="home">416-555-3426</phone> </assistant> </customerinfo> Figure 16.8 A valid document for the XML Schema customer2.xsd If you are designing your own XML Schema, you will find yourself confronted with many design options; for example, when and how to use namespaces and how many, whether to declare global or local elements, global or local types, how to keep schemas extensible, how to version XML Schema documents, and other trade-offs. We chose not to cover these topics in this book, but refer you to the XML Schema best practices at the following URL: http://www.xfront.com/BestPracticesHomepage.html.
16.4 Registering XML Schemas 16.4 483 REGISTERING XML SCHEMAS Before you can use XML Schemas in a DB2 database to validate XML documents, you need to register them in the XML Schema Repository (XSR). The XSR consists of several tables in the DB2 catalog. Registering an XML Schema means that the schema as well as meta information about the schema are inserted into these catalog tables. Registering XML Schemas ensures fast and reliable access to the schemas for document validation. During schema registration, each XML Schema is parsed and stored in a binary representation in the DB2 catalog. When the schema is later used for document validation, it does not have to be parsed again, which is a significant performance benefit. DB2 does not support document validation with XML Schemas that are located in the file system or at some URL on the web—all XML Schemas must be registered with DB2. Once an XML Schema is registered in the XSR, the schema is also called an XSR Object. The XSR catalog tables and views in DB2 for z/OS and DB2 for Linux, UNIX, and Windows are described in detail in section 16.9. Since an XML Schema can generally consist of multiple schema documents, the registration process takes the following steps: 1. Register the primary schema document and assign a unique identifier for the XML Schema. 2. Add additional schema documents that are included or imported. The same schema identifier is used in this step to indicate that these additional schema documents belong to the primary schema document that was registered in step 1. The additional schema documents can be added in any order, independent from the import and include dependencies that exist among them. 3. Complete the schema registration. In this step DB2 verifies the correctness of the schema and checks whether all schema documents that are referenced in xs:include or xs:import declarations have been added. If an XML Schema consists of just a single schema document, then these three steps can be collapsed into a single command. An XML Schema is typically given two kinds of identifiers at schema registration time: • A relational SQL identifier, for example db2admin.custxsd • A schema location URI, which can be any arbitrary string of 1000 bytes or less, such as myschema\customer.xsd or http://pureXMLcookbook.org/customer.xsd The schema location URI does not have to reflect the actual location or filename of the XML Schema. It is up to you to choose a schema location URI that provides a meaningful indication of the identity and/or location of the XML Schema.
484 Chapter 16 Managing XML Schemas In section 17.1, Document Validation Upon Insert, you will see that you can reference an XML Schema in one of the following ways in order to use it for the validation of an XML document: • By the relational SQL identifier of the XML Schema, which must be unique • By the schema location URI provided when the schema was registered, if this URI is unique in the XSR • By the target namespace in the primary schema document, if this namespace is unique in the XSR • By the combination of target namespace and schema location URI, if this pair of values is unique in the XSR Referencing XML Schemas by their SQL identifiers is recommended in most cases, because these identifiers are always unique and for most users they are an intuitive way of referring to database objects. The use of namespaces and schema locations can be useful if you want to allow XML documents to use schema location hints to dynamically select XML Schemas for validation (see Chapter 17, Validating XML Documents against XML Schemas). Depending on the complexity of your XML Schemas, you might need to increase the application heap size (applheapsz) of your DB2 for Linux, UNIX, and Windows database. To register very complex XML Schemas on 32-bit Windows systems, the DB2 agent stack size (agent_ stack_sz) might also need to be increased. You can register and manage XML Schemas with commands in the DB2 Command Line Processor (CLP) or with equivalent stored procedures from an application. Both methods are available in DB2 for z/OS and in DB2 for Linux, UNIX, and Windows. Note that DB2 for z/OS offers a Command Line Processor through UNIX System Services (USS). 16.4.1 Registering XML Schemas in the DB2 Command Line Processor Let’s first look at registering the XML Schema in Figure 16.1, which consists of just a single XML Schema document. After that we show how to register an XML Schema that consists of multiple schema documents, like the one in Figure 16.6 . Figure 16.9 shows a single REGISTER XMLSCHEMA command that registers the XML Schema customer.xsd from Figure 16.1 in the DB2 XML Schema Repository. Let’s look at this command line by line: 1. The first line of the command specifies the schema location URI of the schema, which is specified as the string customer.xsd. We choose this value because it can be helpful to identify this schema among others. But, any other string value could be used here. 2. The second line specifies the directory (c:\xml\myschemas) and filename (customer.xsd) of the actual schema document that is being registered.
16.4 Registering XML Schemas 485 3. The third line assigns a relational SQL identifier to the schema. This identifier is a twopart name consisting of a relational schema name (db2admin) and an identifier for this specific XML Schema (custxsd). In DB2 for z/OS, the relational schema name must be SYSXSR or default to SYSXSR. If the relational schema name is omitted then it defaults to the CURRENT SQLID, which must be SYSXSR in this case. 4. The fourth line indicates that this completes the registration and no further schema documents are required. REGISTER XMLSCHEMA 'customer.xsd' FROM 'FILE:c:\xml\myschemas\customer.xsd' AS db2admin.custxsd COMPLETE; Figure 16.9 Registering an XML Schema that consists of one schema document If the XML Schema is made up of more than one XML Schema document, such as customer2.xsd in Figure 16.6, then the schema registration process consists of multiple steps. First you need to register the primary XML Schema document using the REGISTER XMLSCHEMA command, as shown in Figure 16.10. The second step adds the schema document phone.xsd. The third step adds the schema document addr.xsd. The fourth and last step completes the registration. The COMPLETE command checks that all links in the include and import declarations are resolvable, and that the total XML Schema document is consistent. After completing the schema registration you might want to grant the USAGE privilege to PUBLIC (see section 16.7). REGISTER XMLSCHEMA 'customer2.xsd' FROM 'FILE:c:\xml\myschemas\customer2.xsd' AS db2admin.custxsd2; ADD XMLSCHEMA DOCUMENT TO db2admin.custxsd2 ADD 'phone.xsd' FROM 'FILE:c:\xml\myschemas\phone.xsd'; ADD XMLSCHEMA DOCUMENT TO db2admin.custxsd2 ADD 'addr.xsd' FROM 'FILE:c:\xml\myschemas\addr.xsd'; COMPLETE XMLSCHEMA db2admin.custxsd2; Figure 16.10 Registering an XML Schema that consists of multiple schema documents Note that the clause ADD 'phone.xsd' in the ADD XMLSCHEMA command does not refer to the filename of the schema document on disk, but to how the schema document is referenced in the including schema. Since the including schema customer2.xsd contains the line <xs:include schemaLocation="phone.xsd"/>
486 Chapter 16 Managing XML Schemas you have to use 'phone.xsd' in the ADD clause here. If customer2.xsd was different and contained the line <xs:include schemaLocation="http://PhoneSchema"/> then you would use ADD 'http://PhoneSchema' instead of ADD 'phone.xsd' in the ADD XMLSCHEMA command. This handling of schema location values is independent of the actual filename, which is specified separately in the FROM clause. 16.4.2 Registering XML Schemas from Applications via Stored Procedures The XML Schema registration that was performed with CLP commands in the previous section can also be achieved with stored procedure calls from an application. For each of the three commands that were used in the previous section there is a corresponding stored procedure (see Table 16.2). In DB2 for z/OS, these stored procedures run in a WLM stored procedure address space. Table 16.2 XSR Commands and Stored Procedures to Register XML Schemas Command Line Processor Commands Stored Procedure REGISTER XMLSCHEMA SYSPROC.XSR_REGISTER ADD XMLSCHEMA DOCUMENT SYSPROC.XSR_ADDSCHEMADOC COMPLETE XMLSCHEMA SYSPROC.XSR_COMPLETE The two stored procedure calls in Figure 16.11 register the XML Schema customer.xsd from Figure 16.1. The first call invokes the XSR_REGISTER procedure and passes five parameters: the relational schema name (db2admin), the schema identifier (custxsd), the schema location URI (customer.xsd), and a host variable of type BLOB(30M), which contains the actual schema document that you want to register. A fifth parameter can optionally be used to provide an XML document with descriptive information about this schema document, but the example simply passes NULL instead, to not make use of this option. The second stored procedure call (CALL XSR_COMPLETE) completes the registration and expects the relational schema name and the schema identifier in its first and second parameter, respectively. The third parameter allows for an optional document with meta information about the schema. For example, you can use this parameter if you want to store a description of your schema along with the schema itself. DB2 makes no use of this metadata but allows your application to provide and refer to this information if desired. The fourth parameter can either be 0 or 1, depending on whether the schema will be used for validation only (0), or also for decomposition (1). Decomposition is discussed in Chapter 11, Converting XML to Relational Data.
16.4 Registering XML Schemas 487 CALL XSR_REGISTER('db2admin', 'custxsd', 'customer.xsd', :primarySchemaDocument, NULL); CALL XSR_COMPLETE('db2admin', 'custxsd', NULL, 0); Figure 16.11 Registering an XML Schema that consists of one schema document NOTE If you use these stored procedures in DB2 for z/OS then the relational schema name, if provided, has to be SYSXSR. If omitted, it will default to SYSXSR. If an XML Schema consists of multiple schema documents, then one or multiple invocations of the procedure XSR_ADDSCHEMADOC should be used to add additional schema documents before calling the procedure XSR_COMPLETE. Figure 16.12 shows two calls to XSR_ADDSCHEMADOC to add the schema documents phone.xsd and addr.xsd to the schema, similar to what was discussed for Figure 16.10. CALL XSR_REGISTER('db2admin', 'custxsd2', 'customer2.xsd', :primarySchemaDocument, NULL); CALL XSR_ ADDSCHEMADOC('db2admin', 'custxsd2', 'customer.xsd', :SchemaDocumentPhone, NULL); CALL XSR_ ADDSCHEMADOC('db2admin', 'custxsd2', 'customer.xsd', :SchemaDocumentAddr, NULL); CALL XSR_COMPLETE('db2admin', 'custxsd2', NULL, 0); Figure 16.12 Registering an XML Schema that consists of multiple schema documents
488 16.4.3 Chapter 16 Managing XML Schemas Registering XML Schemas from Java Applications via JDBC The JDBC driver for DB2 for z/OS and DB2 for Linux, UNIX, and Windows provides the method connection.registerDB2XMLSchema, which can be called from a Java application to register an XML Schema. The major difference from the commands and stored procedures discussed in the previous sections is how schemas are registered that consist of multiple schema documents. The registerDB2XMLSchema method can take an array of schema documents as input so that all components of a complex XML Schema are registered in a single call. This means there is no need for separate “ADDSCHEMADOC” calls for each component of a multidocument schema. There are two forms of the registerDB2XMLSchema method: one that takes XML Schema documents as input from InputStream objects, and one that takes XML Schema documents as a String. The sample snippet of Java code in Figure 16.13 illustrates the use of the registerDB2XMLSchema method with input streams. It performs the same schema registration as previously shown in Figure 16.10 and Figure 16.12. String RelSchema = "SYSXSR"; String SchemaIdentifier = "CUSTXSD2"; String[] xmlSchemaLocations = new String[] { "customer2.xsd", "phone.xsd", "addr.xsd" }; FileInputStream[] xmlSchemaDocuments = new FileInputStream[] { new FileInputStream("c:\xml\myschemas\customer2.xsd "), new FileInputStream("c:\xml\myschemas\phone.xsd"), new FileInputStream("c:\xml\myschemas\addr.xsd") }; int[] xmlSchemaDocumentsLengths = new int[] { (int)xmlSchemaDocuments[0].getChannel().size(), (int)xmlSchemaDocuments[1].getChannel().size(), (int)xmlSchemaDocuments[2].getChannel().size() InputStream[] xmlSchemaDocumentsProperties = null; int[] xmlSchemaDocumentsPropertiesLengths = null; InputStream xmlSchemaProperties = null; int xmlSchemaPropertiesLength = 0; boolean isUsedForShredding = false; connection.registerDB2XmlSchema( ReldSchema, SchemaIdentifier, xmlSchemaLocations, xmlSchemaDocuments, xmlSchemaDocumentsLengths, xmlSchemaDocumentsProperties, xmlSchemaDocumentsPropertiesLengths, xmlSchemaProperties, xmlSchemaPropertiesLength, isUsedForShredding); Figure 16.13 Registering an XML Schema via JDBC };
16.4 Registering XML Schemas 16.4.4 489 Two XML Schemas Sharing a Common Schema Document The import and include mechanisms of XML Schema allow you to build schemas in a modular fashion. Section 16.3 described an XML Schema that consists of three schema documents. The primary schema document customer2.xsd referenced the schema documents phone.xsd and addr.xsd to reuse existing definitions for phone numbers and addresses. The schema documents phone.xsd and addr.xsd can also be used in other XML Schemas. For example, you can have a primary schema document supplier.xsd to define the structure of XML documents that contain supplier information. If you want supplier addresses to obey the same address structure as previously defined in addr.xsd, you can import addr.xsd into the schema supplier.xsd. This works just like you previously imported addr.xsd into customer2.xsd in Figure 16.6. As a result, customer2.xsd and supplier.xsd now both rely on addr.xsd as a schema component. They share a common schema document. When you register the XML Schema documents customer2.xsd and supplier.xsd, you need to add the schema document addr.xsd twice (see Figure 16.14 ), once to participate in the customer schema, and once to participate in the supplier schema. Hence, the XML Schema Repository now contains two copies of the schema document addr.xsd. Although customer2.xsd and supplier.xsd logically share addr.xsd as a common schema document, DB2 requires separate physical copies of the shared document. Registering separate copies of addr.xsd is a good thing, because it allows you to drop, update, or version the customer and the supplier schemas independently from each other, if you have to. REGISTER XMLSCHEMA 'customer2.xsd' FROM 'FILE:c:\xml\myschemas\customer2.xsd' AS db2admin.custxsd2; ADD XMLSCHEMA DOCUMENT TO db2admin.custxsd2 ADD 'phone.xsd' FROM 'FILE:c:\xml\myschemas\phone.xsd'; ADD XMLSCHEMA DOCUMENT TO db2admin.custxsd2 ADD 'addr.xsd' FROM 'FILE:c:\xml\myschemas\addr.xsd'; COMPLETE XMLSCHEMA db2admin.custxsd2; REGISTER XMLSCHEMA 'supplier.xsd' FROM 'FILE:c:\xml\myschemas\supplier.xsd' AS db2admin.suppxsd; ADD XMLSCHEMA DOCUMENT TO db2admin.suppxsd ADD 'addr.xsd' FROM 'FILE:c:\xml\myschemas\addr.xsd'; COMPLETE XMLSCHEMA db2admin.suppxsd; Figure 16.14 Two XML Schemas sharing a common schema document (addr.xsd)
490 Chapter 16 Managing XML Schemas The dependencies between schemas and their included or imported schema documents are recorded in the catalog view SYSCAT.XSROBJECTHIERARCHIES in DB2 for Linux, UNIX, and Windows, and the catalog table SYSIBM.XSROBJECTHIERARCHIES in DB2 for z/OS. The result of the query in Figure 16.15 reveals the fact that both the customer and the supplier schemas depend on the schema document whose schema location is addr.xsd. The column HTYPE represents the type of hierarchy, where P indicates the primary schema document and D flags other (non-primary) schema documents that belong to the schema. SELECT SUBSTR(o.objectname,1,25) AS schema, h.htype, SUBSTR(h.schemalocation,1,35) AS schema_component FROM syscat.xsrobjecthierarchies h, syscat.xsrobjects o WHERE h.objectid = o.objectid; SCHEMA ------------------CUSTXSD2 CUSTXSD2 CUSTXSD2 SUPPXSD SUPPXSD HTYPE ----P D D P D SCHEMA_COMPONENT ----------------------------------customer2.xsd phone.xsd addr.xsd supplier.xsd addr.xsd 5 record(s) selected. Figure 16.15 Dependencies between schemas and their components (DB2 for LUW) DB2 for z/OS uses tables instead of catalog views, as summarized in section 16.9. Hence, the query in Figure 16.15 would be written for DB2 for z/OS as shown in Figure 16.16, with the differences highlighted in bold font. SELECT SUBSTR(o.objectname,1,25) AS schema, h.htype, SUBSTR(h.schemalocation,1,35) AS schema_component FROM sysibm.xsrobjecthierarchies h, sysibm.xsrobjects o WHERE h.xsrobjectid = o.xsrobjectid; Figure 16.16 16.4.5 Dependencies between schemas and their components (DB2 for z/OS) Error Situations and How to Resolve Them When you register an XML Schema, DB2 parses the schema and verifies that it complies with the XML Schema standard. If a schema consists of multiple schema documents, this verification happens in the COMPLETE step of the registration process. In this step, all import and include dependencies are verified and the information from all schema documents is compiled into a single binary XML Schema grammar that allows fast document validation at runtime. DB2 reports appropriate errors if any of the schema documents are not well-formed or if they don’t form a correct XML Schema. In this case it is recommended to use an XML Schema editor,
16.4 Registering XML Schemas 491 such as the ones in IBM Data Studio Developer or Altova XMLSPY, to review and correct the schema. Let’s briefly look at two specific errors that you might find difficult to resolve if an XML Schema consists of a large number of schema documents. First, consider SQL error SQL20329N, which can occur in the COMPLETE step of the registration process: SQL20329N The completion check for the XML schema failed because one or more XML schema documents is missing. One missing XML schema document is identified by "LOCATION" as "assistant.xsd". SQLSTATE=428GI This error happens when a schema document with schema location URI assistant.xsd cannot be found in the XML Schema Repository, but another schema document tries to include it with an xs:include specification such as the following: <xs:include schemaLocation="assistant.xsd"/> A typical reason for this error is that the missing schema document was added to the XSR, using the ADD XMLSCHEMA DOCUMENT command or the XSR_ADDSCHEMADOC stored procedure, but a schema location URI other than assistant.xsd was specified. To avoid this error, make sure that schema documents are always registered with the same schema location value that other schema documents use to refer to it in xs:include elements. Next, consider SQL error SQL20340N, which can also occur in the COMPLETE step of a schema registration: SQL20340N The XML schema "DB2ADMIN.CUSTOMER2" includes at least one XML schema document in namespace "http://pureXMLcookbook.org" with component ID "58546795155936256" that is not connected to the other XML schema documents in the same namespace using an include or redefine. SQLSTATE=22534 This error can arise when two or more schema documents in your XML Schema declare the same target namespace, which is http://pureXMLcookbook.org in this example. All of those schema documents must be connected via xs:include specifications. It can be helpful to visualize all xs:include dependencies as a graph on a piece of paper, with a node for each schema document and an arrow for each xs:include that links two schema documents. In this graph, all the nodes that represent schema documents with the same target namespace declaration must be connected. Otherwise error SQL20340N is raised. Use the big integer component ID shown in the error message to identify the disconnected schema document using a query such as the following: SELECT schemalocation, targetnamespace FROM syscat.xsrobjectcomponents WHERE componentid = 58546795155936256;
492 16.5 Chapter 16 Managing XML Schemas REMOVING XML SCHEMAS FROM THE SCHEMA REPOSITORY In DB2 for Linux, UNIX, and Windows, you can remove XML Schemas, DTDs, and external entities from the XML Schema Repository using the DROP XSROBJECT command. For example, to drop the XML Schema db2admin.custxsd, issue the command: DROP XSROBJECT db2admin.custxsd If you drop an XML Schema then all schema documents that belong to that schema are removed from the XSR in a cascading manner. Any check constraints that reference the schema are also dropped. If you have any triggers, views, or packages that reference the object, then these are marked as inoperative or invalid. In DB2 for z/OS you use a stored procedure SYSPROC.XSR_REMOVE to drop schemas. The following stored procedure call removes the schema custxsd: CALL SYSPROC.XSR_REMOVE('SYSXSR','CUSTXSD') As an alternative to the DROP XSROBJECT command and the XSR_REMOVE procedure, you can also use the JDBC method Connection.deregisterDB2XMLObject in your Java application. You can drop an XML Schema even if there are XML documents in the database that have been validated against that schema. In DB2, XML Schemas are not assigned to an entire table or XML column because that would be too restrictive and prevents schema variability within an XML column. Instead, the relationship between XML Schemas and XML instance documents is managed on a per-document basis (see Chapter 17). If you use DB2 for Linux, UNIX, and Windows, then the fact that a document has been validated against a certain schema is automatically recorded with the document itself in DB2’s XML storage. The XML document continues to carry the ID of the schema that it has been validated against, even if that schema gets dropped. In DB2 for z/OS the relationship between documents and a schema is not externalized. To make this relationship explicit, you can insert a schema identifier into an extra column of your table alongside each XML document. DB2 has no built-in mechanism that prevents you from dropping a schema even if documents validated with this schema still exist in the database. This is a conscious DB2 design decision because such a mechanism would require extra processing for every single XML insert, delete, and update operation, and it is better to avoid this overhead. In DB2 for Linux, UNIX, and Windows you can use the query in Figure 16.17 to determine the number of documents in the customer table that have been validated against the XML Schema db2admin.custxsd. The function XMLXSROBJECTID takes the XML column name as input. It returns DB2’s internal ID number of the XML Schema that the document in the current row was
16.6 XML Schema Evolution 493 validated against, even if that schema was subsequently dropped. This schema ID joins to the catalog table SYSCAT.XSROBJECTS where the relational SQL identifier of the schema (db2admin.custxsd) is stored. The function XMLXSROBJECTID returns zero if the document was not validated. SELECT count(*) FROM customer a, syscat.xsrobjects b WHERE XMLXSROBJECTID(a.info) = b.objectid AND b.objectschema = 'DB2ADMIN' AND b.objectname = 'CUSTXSD'; Figure 16.17 Finding documents that were validated with a given XML Schema DB2’s internal unique identification number of an XML Schema is created when you register a schema and it is stored in the column OBJECTID of the catalog view SYSCAT.XSROBJECTS. The OBJECTID cannot be changed. When an XML document is validated against an XML Schema in DB2 for Linux, UNIX, and Windows, this unique identifier (and not the XML Schema name) is stored with the XML document. If you drop an XML Schema and then register the same or a different schema under the same name, it will be assigned a different internal identification number. The query in Figure 16.17 will then no longer return the same result. The function XMLXSROBJECTID will continue to return the internal ID of the schema that was dropped and is now missing from SYSCAT.XSROBJECTS. This behavior makes sense for the following reason. If you drop one schema and then register a different schema under the same name, any XML documents that are valid against the dropped schema are not necessarily valid against the new schema. These documents are not automatically revalidated against the new schema. If you revalidate them explicitly then the documents will obtain the internal object ID of the new schema and lose the OBJECTID of their previous schema. In other words, if a document is validated multiple times, it retains only the OBJECTID of the XML Schema in the most recent validation. If you have been using a certain version of your XML Schema and you want to start using an updated version of the schema, then you should consider the UPDATE XMLSCHEMA command. It enables you to perform compatible schema evolution and is described in the next section. 16.6 XML SCHEMA EVOLUTION One of main reasons for using XML is its flexibility and extensibility. XML as a data format enables you to react quickly to changing business needs. When products, services, processes, or other parts of your business change, this change typically needs to be reflected in the data that you capture, store, and process. For example you might have to allow for additional data items to be included in your XML documents. As a result, you are likely to keep enhancing your XML Schema that defines your XML data. This process leads to new versions of your XML Schema. In the following sections we describe three approaches of dealing with such XML Schema evolution.
494 16.6.1 Chapter 16 Managing XML Schemas Schema Evolution Without Document Validation If you are not using your XML Schemas to validate documents in DB2, schema evolution at the database level is easy. Without validation, DB2 is schema agnostic and allows you to insert any kind of well-formed XML documents into an XML column. You don’t even have to register any XML Schemas in DB2. At any point in time your application can decide to switch the document format and insert XML documents that comply with a new version of your XML Schema. The documents can be inserted into the same XML column as before. In this scenario DB2 is unaware of the fact that the new documents belong to a different XML Schema than the documents that were inserted previously. To help your application distinguish which documents belong to which version of your XML, you may decide to add an integer column to your table so you can record the version number of the schema for every document that is inserted or updated. 16.6.2 Generic Schema Evolution with Document Validation XML Schema evolution becomes more interesting if you perform validation in DB2. If you register each version of your XML Schema in DB2 for Linux, UNIX, and Windows and use it for document validation, a situation like the one in Figure 16.18 evolves. The same situation evolves in DB2 for z/OS, except that the relationship between validated documents and their schemas is not externally visible. But, you can maintain a schema identifier in a separate column of your user table to make the relationship between documents and schemas explicit. You start out by registering the initial version of your XML Schema in DB2, using the name customerV1xsd. At that time DB2 assigns an internal OBJECTID to this schema. In the example in Figure 16.18 this is the number 53521. For a while your application works with just this one XML Schema. Several XML documents are inserted into the customer table and validated against the schema customerV1xsd. These are the documents with the id values 1, 2, and 3 in Figure 16.18. If you apply the function XMLXSROBJECTID to any of these documents, you will always obtain the value 53521. This value links the documents to the schema they were validated with. After some time your business requires you to change or extend your XML Schema. You create the second version of your XML Schema and register it as customerV2xsd in DB2. For DB2 this is an entirely different schema and a new OBJECTID is assigned (33496). From this point onwards you validate all new XML documents against the new version of your schema. In Figure 16.18, these are the documents 4 and 5. The function XMLXSROBJECTID() will reveal that these documents point back to schema customerV2xsd. Eventually, you find yourself required to make yet another schema evolution step. You register the third version of your schema as customerV3xsd and validate new documents against this latest version of your schema. Note that the introduction of new schemas (or schema versions) does not require old documents to be revalidated against the new schema. Such “bulk validation” can be done in DB2 (see Chapter 17) but it is often a very time-consuming operation and therefore usually avoided.
16.6 XML Schema Evolution 495 Table: customer id 1 2 3 4 info xmlxsrobjectid SYSCAT.XSROBJECTS OBJECTID OBJECTSCHEMA OBJECTNAME 53521 db2admin customerV1xsd 33496 db2admin customerV2xsd 70472 db2admin customerV3xsd ... 5 6 7 Figure 16.18 XML documents validated against different versions of a schema In the scenario depicted in Figure 16.18 , the XML Schema Repository contains the history of your XML Schema versions and each document is correctly linked to its corresponding version of your XML Schema. For additional comfort you can certainly add an integer column version to the customer table to explicitly record the schema version with each row (document). The advantage of managing schema evolution as shown in Figure 16.18 is that no version of your XML Schema is required to be backward compatible with any previous version. The schema customerV2xsd would be backward compatible with customerV1xsd if every document that was valid for customerV1xsd is also valid for customerV2xsd. For example, if the difference between the two versions is only that customerV2xsd defines additional optional elements, then customerV2xsd is backward compatible with customerV1xsd. However, if customerV2xsd declares new mandatory elements that did not exist in customerV1xsd, then this is a non-compatible schema evolution. In this case, documents that were valid for customerV1xsd will no longer be valid for customerV2xsd. Both compatible and non-compatible schema evolution is possible with the schema evolution approach described in this section. A disadvantage of this approach can be that a number of different schema versions are explicitly distinguished and managed in the database and the application. Depending on the details of your application, this complexity may or may not be easy to handle. 16.6.3 Compatible Schema Evolution with the UPDATE XMLSCHEMA Command It is very common that a new version of a schema is backward compatible to the previous version of the schema. Backward compatible means that any document that is valid for the previous version is also valid for the new version of the schema. In this case, DB2 9.5 for Linux, UNIX, and Windows allows you to replace (update) the old version of the schema with the new version. After this operation, only the new version of the schema remains in DB2’s XML Schema Repository, and all documents that had been validated against the previous version of the schema now appear as if they had been validated against the new version. This schema replacement allows you to continue to work with a single XML Schema, instead of two.
496 Chapter 16 Managing XML Schemas Updating an old schema with a compatible new schema is a quick operation in the DB2 catalog. The existing XML documents are not revalidated, updated, examined, or touched in any way. DB2 only compares the old with the new schema to verify that they are compatible. If they are compatible, the UPDATE XMLSCHEMA command succeeds; otherwise it fails. The new schema assumes the name and the OBJECTID of the old schema, and thus seamlessly takes its place. Now let’s look at a compatible schema evolution step by step. 1. Create a table: create table customer(id integer, info XML); 2. Register the initial version of your XML Schema under the name custxsd: REGISTER XMLSCHEMA 'customerV1.xsd' FROM 'FILE:c:\xml\myschemas\customerV1.xsd' AS db2admin.custxsd COMPLETE; 3. Insert any number of documents into the table and validate them against the schema custxsd. The situation in your database now looks like Figure 16.19. There is one schema listed in the XSROBJECTS catalog view, and if you apply the function XMLXSROBJECTID() to any of the validated documents it always returns the OBJECTID of that one XML Schema. Table: customer id 1 2 3 Figure 16.19 info xmlxsrobjectid SYSCAT.XSROBJECTS OBJECTID OBJECTSCHEMA OBJECTNAME 53521 db2admin custxsd ... Three documents, validated with XML Schema custxsd 4. No matter how many documents are already stored in the table and validated against custxsd, at some point your application might have to start using a new but compatible version of your XML Schema, custxsd_V2. Register this schema just like you registered the previous schema: REGISTER XMLSCHEMA 'customerV2.xsd' FROM 'FILE:c:\xml\myschemas\customerV2.xsd' AS db2admin.custxsd_V2 COMPLETE;
16.6 XML Schema Evolution 497 5. Figure 16.20 shows that the new version of your schema appears as a separate entry in the schema repository. It has a different name and OBJECTID than the first version. Table: customer id info xmlxsrobjectid 1 2 3 Figure 16.20 SYSCAT.XSROBJECTS OBJECTID OBJECTSCHEMA OBJECTNAME 53521 db2admin custxsd 33496 db2admin custxsd_V2 ... A second schema has been registered. If you want to perform compatible schema evolution, no documents should be validated against the new schema custxsd_V2 before the UPDATE XMLSCHEMA command has been issued ! NOTE 6. Perform the UPDATE XMLSCHEMA command to replace the old schema with the new one: UPDATE XMLSCHEMA db2admin.custxsd WITH db2admin.custxsd_V2 DROP NEW SCHEMA; Alternatively, you can use the stored procedure XSR_UPDATE or the JDBC method Connection.updateDB2XmlSchema to achieve the same effect. After this command, the situation in your database looks much like it did before you registered the new schema (see Figure 16.19 ). What happened? The content of the new version of your schema has been used to overwrite the content of the old version. This operation has temporarily produced two copies of the new schema, one under the name custxsd and one under the name custxsd_V2. But, the DROP NEW SCHEMA clause of the UPDATE XMLSCHEMA command has automatically removed the schema with the name custxsd_V2 and OBJECTID 33496. A single copy of the new XML Schema remains, registered under the name and OBJECTID of the old schema. Therefore your application can seamlessly continue to reference the old schema name (custxsd) to validate new documents against the new version of the schema. Since applications do not need to start using a different schema name, schema evolution is possible without interruption to live applications. Note that the single remaining copy of the XML Schema now carries the schema location of the new version of the schema (customerV2.xsd). This can be helpful if new documents use this value in their schema location attributes to reference a schema. However, when registering the new version of your schema you can specify the same schema location as for the previous version, if you prefer to keep the same value.
498 Chapter 16 Managing XML Schemas 7. Insert additional documents and validate them with the schema named custxsd, as you did prior to the schema evolution. The old schema name custxsd now identifies the new schema. As shown in Figure 16.21, both old and new documents carry the OBJECTID of the same schema. Table: customer id info xmlxsrobjectid 1 2 SYSCAT.XSROBJECTS OBJECTID OBJECTSCHEMA OBJECTNAME 53521 db2admin custxsd ... 3 4 5 Figure 16.21 Documents inserted after updating the schema If the two XML Schemas referenced in the UPDATE XMLSCHEMA command are not compatible, DB2 produces error message SQL20432N and indicates which of the ten compatibility rules in Table 16.3 has been violated. The compatibility rules ensure that the new schema is not more restrictive than the old schema. Backward compatibility implies that the new schema cannot remove any element, attribute, or type declarations that are present in the old schema. This is required so that any document that was valid for the old schema is automatically valid in the new schema. Table 16.3 Conditions for XML Schema Compatibility Rule Description (1) Attributes Attributes in the old XML Schema must also be present in the new XML Schema. Also, the new XML Schema cannot contain required attributes unless they are already included in the old XML Schema. Only optional attributes may be added to the schema. (2) Elements Elements in the old XML Schema must also be present in the new XML Schema. Also, the new XML Schema cannot contain required elements unless they are already included in the old XML Schema. Only optional elements may be added to the schema.
16.7 Granting and Revoking XML Schema Usage Privileges Table 16.3 499 Conditions for XML Schema Compatibility (Continued) Rule Description (3) Simple type conflict The value range of a simple type in the new XML Schema must be equal or larger than the value range of the same simple type in the old XML Schema. For example, if the customer name is restricted to a string of 20 characters in the old schema and the new schema allows 30 characters, that’s a backward compatible schema change. However, if the old schema defines the “year” to be a four-digit integer value and new schema defines “year” as a two-digit integer, then this is not backward compatible. (4) Incompatible type The data type of an element or attribute in the new XML Schema must be equal or more inclusive than the data type of the same element or attribute in the old schema. For example, if an element “postal code” is defined as xs:integer in the old schema and as xs:string in the new schema, then this is compatible because every integer is a valid string. (5) Mixed content vs. not mixed content If the old XML Schema declares an element such that it can contain mixed content, then the same element must also be allowed to contain mixed content in the new schema. Mixed content is explained in section 3.1 Understanding XML Document Trees. (6) Nillable vs. not nillable If the attribute nillable in an element declaration of the original XML Schema is turned on, it must also be turned on in the new XML Schema. (Note that the attribute nillable is rarely used and should not be used as an equivalent to “nullable” columns in relational tables. Use optional elements and attributes to allow for missing values.) (7) Removed element Global elements declared in the old XML Schema must also be present in the new XML Schema, and must not be declared as “abstract.” (8) Removed type If the old XML Schema contains a global type that is derived from another type, the global type must also be present in the new XML Schema. (9) Simple to complex A complex type that contains simple content in the old XML Schema cannot contain complex content in the updated XML Schema. (10) Simple content Simple types defined in the old XML Schema and in the new XML Schema must be based on the same built-in data types. 16.7 GRANTING AND REVOKING XML SCHEMA USAGE PRIVILEGES In DB2 for z/OS no privileges are associated with XML Schemas. After an XML Schema is registered, any user can reference that schema to validate XML documents. In DB2 for Linux, UNIX, and Windows, the USAGE privilege for an XML Schema is automatically granted to the user who registers the schema in the XSR. If you want to allow other database users to use the XML Schema, you need to explicitly grant USAGE of the XML Schema to PUBLIC using the following command:
500 Chapter 16 Managing XML Schemas GRANT USAGE ON XSROBJECT db2admin.custxsd TO PUBLIC The XML Schema to grant usage on is referenced by its SQL identifier (db2admin.custxsd) that was assigned during schema registration. Note that PUBLIC is currently the only user to whom the usage of an XSR object can be granted. If you try to grant usage to a specific user or DB2 role, the GRANT command fails with error SQL0104N. If you do not grant USAGE of the XML Schema to PUBLIC, then a user trying to access the XML Schema for validation will receive the following error message: SQL0551N "<user>" does not have the privilege to perform operation "Validation" on object "<object-name>". SQLSTATE=42501. If an XML Schema consists of multiple documents, then the user who registers the primary schema document (through the XSR_REGISTER stored procedure, for example) must also be the user who adds additional XML Schema documents and completes the registration process. You can revoke the USAGE privilege for an XML Schema with the command: REVOKE USAGE ON XSROBJECT db2admin.custxsd FROM PUBLIC The query in Figure 16.22 checks the usage authorizations for all XML Schemas in DB2 for Linux, UNIX, and Windows databases. The result shows that user DB2ADMIN has usage authorization for the XML Schemas with the identifiers custxsd2 and supplier because this user has registered these schemas. Additionally, the third row in the result shows that user DB2ADMIN has run the GRANT USAGE command to authorize PUBLIC to use the schema custxsd2. The catalog views XSROBJECTAUTH and XSROBJECTS are explained in detail in section 16.9. SELECT SUBSTR(a.grantor,1,10) AS grantor, SUBSTR(a.grantee,1,10) AS grantee, SUBSTR(b.objectname,1,10) AS xmlschema, a.usageauth FROM syscat.xsrobjectauth a, syscat.xsrobjects b WHERE a.objectid = b.objectid; GRANTOR ---------SYSIBM SYSIBM DB2ADMIN GRANTEE ---------DB2ADMIN DB2ADMIN PUBLIC XMLSCHEMA ---------CUSTXSD2 SUPPLIER CUSTXSD2 USAGEAUTH --------G G Y 3 record(s) selected. Figure 16.22 Checking the usage authorization for XML Schemas
16.8 Document Type Definitions (DTDs) and External Entities 16.8 501 DOCUMENT TYPE DEFINITIONS (DTDS) AND EXTERNAL ENTITIES Prior to the emergence of the XML Schema as the de facto standard for constraining XML document structures, DTDs (Document Type Definitions) were frequently used to define and constrain XML data. Although DTDs are still being used in some applications areas, such as publishing, XML Schemas are by far the most common choice for defining and validating XML documents. Therefore DB2 does currently not support validation of XML documents against DTDs, only validation against XML Schemas. Almost every DTD can easily be converted into an equivalent XML Schema. You can find free and commercial tools for this conversion on the Internet. Note that DTDs have a variety of shortcomings as compared to XML Schemas, including the following: • With DTDs you cannot define data types such as integer, decimal, date, and so on for your XML elements and attributes. • DTDs do not allow you to define and reuse complex element types. • DTDs do not allow you to declare one specific XML element to be the root element. Since all element definitions in a DTD are global, any defined element can be interpreted as a valid root element. As a result, a DTD can typically not force documents to have a specific root element. • Occurrence indicators in DTDs are limited to + (one or more occurrences), ? (exactly one occurrence), and * (zero or more occurrences). DTDs do not allow you to specify, for example, that an element has to occur at least 2 and at most 5 times. • It is practically impossible to define, manage, and validate namespaces with DTDs. • DTDs themselves are not written in XML notation. Although DB2 does not support DTD validation, DB2 for Linux, UNIX, and Windows allows you to register DTDs in the XML Schema Repository (XSR). If your XML documents contain references to an external DTD, then this DTD must be registered in the XSR. The DTD can define default values for attributes or so-called entities, which can be referenced in the XML documents. To store a correct representation of such documents, DB2 accesses the DTD in the XSR to check for default attribute values and to resolve entity references as needed. DB2 for z/OS only allows internal DTDs. Internal DTDs are embedded inside an XML instance document and therefore do not need to be registered in the XSR. If an internal DTD is present, default attribute values are applied and entity references in the document are resolved. DB2 for z/OS also tolerates XML documents that contain a reference to an external DTD, but never reads or processes external DTDs. Figure 16.23 shows an XML document that references an external DTD as well as an entity &mycity; in the city element. In DB2 for Linux, UNIX, and Windows, this XML document cannot be inserted into a DB2 database unless the DTD customer.dtd is registered in DB2’s XSR.
502 Chapter 16 Managing XML Schemas <?xml version="1.0" standalone="no" ?> <!DOCTYPE customerinfo SYSTEM "customer.dtd"> <customerinfo Cid="1004"> <name>Matt Foreman</name> <addr country="Canada"> <street>1596 Baseline</street> <city>&mycity;</city> <prov-state>Ontario</prov-state> <pcode-zip>M3Z 5H9</pcode-zip> </addr> </customerinfo> Figure 16.23 A document with reference to a DTD The following REGISTER XSROBJECT command adds the DTD customer.dtd to the XSR: REGISTER XSROBJECT 'customer.dtd' FROM 'file:c:/xml/DTDs/customer.dtd' AS db2admin.custdtd Alternatively, DB2 also offers the stored procedure XSR_DTD to register DTDs from an application program via an API. For the document in Figure 16.23, the registered DTD has to contain a line with an internal entity definition for the entity mycity, such as <!ENTITY mycity "Markham, Greater Toronto Area"> When the XML document is inserted into an XML column, DB2 examines the referenced DTD in the XSR. This enables DB2 to replace the entity reference &mycity; with the string value "Markham, Greater Toronto Area", which is defined in the entity declaration in the DTD. For further details on DTDs, internal entities, and external entities, please see http://www. w3schools.com/DTD and http://www.w3schools.com/dtd/dtd_entities.asp. 16.9 BROWSING THE XML SCHEMA REPOSITORY (XSR) The XML Schema Repository (XSR) consists of several catalog tables, which allow you to query information about the XML Schemas that are registered. You cannot add, change, or remove information from the XML Schema Repository through SQL statements. All modifications are made through the XSR commands or stored procedures, which allow you to register, drop, or update XML Schemas. In DB2 for Linux, UNIX, and Windows the XSR tables, views, and stored procedures are created when a database in created. In DB2 for z/OS the XSR tables, views, stored procedures, and functions need to be created explicitly using the installation job DSNTIJSG or the migration job DSNTIJNX. By default the XSR tables are created in DSNXSR.SYSXSR using STOGROUP SYSDEFLT. To create and use the
16.9 Browsing the XML Schema Repository (XSR) 503 XSR you need to have Java 5 or later installed and a WLM environment set up. The XSR stored procedures and UDFs run in the WLM address space. The procedure XSR_COMPLETE uses Java to compile XML Schema Documents (XSDs) into a binary schema representation for efficient document validation in INSERT and UPDATE statements. In Appendix C you find a link to the DB2 for z/OS XSR Setup and Troubleshooting Guide. 16.9.1 Tables and Views of the XML Schema Repository DB2 for Linux, UNIX, and Windows provides catalog views in the SYSCAT schema for easier browsing of the XSR tables. DB2 for z/OS provides similar information through catalog tables in schema SYSIBM, which have the same names as the catalog views in DB2 for Linux, UNIX, and Windows. The table and view names are summarized in Table 16.4 . Note that in DB2 for z/OS, BLOB columns reside in separate auxiliary tables. Table 16.4 Tables and Views of the XML Schema Repository DB2 for Linux, UNIX, and Windows Catalog Views (Schema: SYSCAT) DB2 for Linux, UNIX, and Windows Catalog Tables (Schema: SYSIBM) DB2 for z/OS Catalog Table (Schema: SYSIBM) XSROBJECTS SYSXSROBJECTS XSROBJECTS + BLOBs in XSROBJECTGRAMMAR and XSROBJECTPROPERTY XSROBJECTCOMPONENTS SYSXSROBJECTCOMPONENTS XSROBJECTCOMPONENTS + BLOBs in XSRCOMPONENT and XSRPROPERTY XSROBJECTHIERARCHIES SYSXSROBJECTHIERARCHIES XSROBJECTHIERARCHIES XSROBJECTDEP SYSXSROBJECTDEP - XSROBJECTAUTH SYSXSROBJECTAUTH - XDBMAPGRAPHS SYSXDBMAPGRAPHS - XDBMAPSHREDTREES SYSXDBMAPSHREDTREES - The XSR catalog views in DB2 for Linux, UNIX, and Windows provide the following information, which is also visualized in Figure 16.24. In DB2 for z/OS, this information exists in the corresponding catalog tables. • SYSCAT.XSROBJECTS—This is the main view, which contains one row for every XML Schema that is registered. The row holds the OBJECTID of the schema, its name, target namespace, schema location, and other meta information. The actual schema documents (XSD files) are not in this view, but in SYSCAT.XSROBJECTCOMPONENTS.
504 Chapter 16 Managing XML Schemas • SYSCAT.XSROBJECTCOMPONENTS—An XML Schema can consist of one or multiple schema documents, also called components. SYSCAT.XSROBJECTCOMPONENTS therefore contains one or multiple rows for each XML Schema listed in SYSCAT. XSROBJECTS. Each row describes one schema document. SYSCAT.XSROBJECTCOMPONENTS has a column COMPONENT of type BLOB(30M), which contains the actual schema document in binary format. • SYSCAT.XSROBJECTHIERARCHIES—This view contains information about the hierarchical relationships between an XML Schema and its components. • SYSCAT.XSROBJECTDEP—This view lists dependencies, if any, between XML Schemas and other database objects. Currently, there is only one kind of dependency that is being recorded, and that is for XML Schemas that are enabled for decomposition (shredding). Such schemas depend on the target tables that are used for shredding. • SYSCAT.XSROBJECTAUTH—Each row in this view represents a user or a group that has been granted the USAGE privilege on a particular XSR object, such as an XML Schema. • SYSCAT.XDBMAPGRAPHS and SYSCAT.XDBMAPSHREDTREES—These two views contain information when an XML Schema has been annotated with mapping information that can be used to shred XML data into relational tables. The so-called annotated schema shredding or decomposition is discussed in Chapter 11. XSROBJECTAUTH GRANTOR GRANTORTYPE GRANTEE GRANTEETYPE OBJECTID USAGEAUTH XSROBJECTDEP OJECTID OJECTIDSCHEMA OJECTNAME BTYPE BSCHEMA BMODULENAME BNAME BMODULEID TABAUTH Figure 16.24 XSROBJECTCOMPONENTS XSROBJECTS OBJECTID OJECTIDSCHEMA OJECTNAME TARGETNAMESPACE SCHEMALOCATION OBJECTINFO OBJECTTYPE OWNER OWNERTYPE CREATE_TIME ALTER_TIME STATUS DECOMPOSITION REMARKS OBJECTID OJECTIDSCHEMA OJECTNAME COMPONENTID TARGETNAMESPACE SCHEMALOCATION COMPONENT CREATE_TIME STATUS XSROBJECTHIERARCHIES OBJECTID COMPONENTID HTYPE TARGETNAMESPACE SCHEMALOCATION DB2’s XML Schema Repository The following tables provide more detail on the columns in these catalog views.
16.9 Browsing the XML Schema Repository (XSR) Table 16.5 Object 505 SYSCAT.XSROBJECTS—Each Row Represents an XML Schema Repository Column Name Data Type Description OBJECTID BIGINT Unique generated identifier for an XSR object OBJECTSCHEMA VARCHAR(128) Schema name of the XSR object OBJECTNAME VARCHAR(128) Unqualified name of the XSR object TARGETNAMESPACE VARCHAR(1001) String identifier for the target namespace SCHEMALOCATION VARCHAR(1001) String identifier for the schema location, or system identifier OBJECTINFO XML Optional metadata document describing the object OBJECTTYPE CHAR(1) XSR object type D = DTD E = External Entity S = XML Schema OWNER VARCHAR(128) Authorization ID under which the XSR object was registered OWNERTYPE CHAR(1) S = The owner is the system U = The owner is an individual user CREATE_TIME TIMESTAMP Time at which the object was registered ALTER_TIME TIMESTAMP Time at which the object was last updated (replaced) STATUS CHAR(1) Registration status C = Complete I = Incomplete R = Replace T = Temporary DECOMPOSITION CHAR (1) Indicates whether decomposition (shredding) is enabled for this XML Schema N = Not enabled X = Inoperative Y = Enabled REMARKS VARCHAR(254) User-provided comments, or null In DB2 for z/OS, the corresponding catalog table is called SYSIBM.XSROBJECTS and the first three columns are called XSROBJECTID, XSROBJECTSCHEMA, and XSROBJECTNAME. Its TARGETNAMESPACE column is an integer and is the value of the STRINGID column in
506 Chapter 16 Managing XML Schemas SYSIBM.SYSXMLSTRINGS where the target namespace URI of the primary XML Schema document is stored. Similarly, its SCHEMALOCATION column is an integer and also a pointer into SYSIBM.SYSXMLSTRINGS where the schema location URI of the primary XML Schema document is stored. See Chapter 3 for more information on SYSIBM.SYSXMLSTRINGS. Table 16.6 SYSCAT.XSROBJECTCOMPONENTS—Each Row Represents an XSR Object Component, Such as a Schema Document Column Name Data Type Description OBJECTID BIGINT Unique generated identifier for an XSR object OBJECTSCHEMA VARCHAR(128) Schema name of the XSR object OBJECTNAME VARCHAR(128) Unqualified name of the XSR object. COMPONENTID BIGINT Unique generated identifier for an XSR object component TARGETNAMESPACE VARCHAR(1001) String identifier for the target namespace SCHEMALOCATION VARCHAR(1001) String identifier for the schema location COMPONENT BLOB (30M) External representation of the component. Actual schema documents are stored here CREATE_TIME TIMESTAMP Time at which the XSR object component was registered STATUS CHAR (1) Registration status C = Complete I = Incomplete In DB2 for z/OS, the TARGETNAMESPACE and SCHEMALOCATION columns again hold integer values that point into SYSIBM.SYSXMLSTRINGS. Table 16.7 SYSCAT.XSROBJECTHIERARCHIES—Each Row Represents the Hierarchical Relationship between an XSR Object and Its Components Column Name Data Type Description OBJECTID BIGINT Identifier for an XSR object COMPONENTID BIGINT Identifier for an XSR component HTYPE CHAR (1) Hierarchy type D = Document N = Top-level namespace P = Primary document TARGETNAMESPACE VARCHAR(1001) Identifier for the component’s target namespace SCHEMALOCATION VARCHAR(1001) Identifier for the component’s schema location
16.9 Browsing the XML Schema Repository (XSR) 507 In DB2 for z/OS the corresponding catalog table is called SYSIBM.XSROBJECTHIERARCHIES and the first two columns are called XSROBJECTID and XSRCOMPONENTID. As in the XSROBJECTS table, the TARGETNAMESPACE and SCHEMALOCATION columns are integer values. Table 16.8 SYSCAT.XSROBJECTDEP—Each Row Represents a Dependency of an XSR Object on Some Other Object Column Name Data Type Description OBJECTID BIGINT Unique generated identifier for an XSR object OBJECTSCHEMA VARCHAR(128) Schema name of the XSR object OBJECTNAME VARCHAR(128) Unqualified name of the XSR object BTYPE CHAR(1) Type of object on which there is a dependency, such as T if the schema depends on a table into which the schema shreds BSCHEMA VARCHAR(128) Schema name of the object on which there is a dependency BNAME VARCHAR(128) Unqualified name of the object on which there is a dependency. For routines (BTYPE = 'F'), this is the specific name TABAUTH SMALLINT If BTYPE = 'O', 'S', 'T', 'U', 'V', 'W', or 'v', encodes the privileges on the table or view that are required by a dependent trigger; null value otherwise Table 16.9 SYSCAT.XSROBJECTAUTH—Each Row Represents a User or Group That Has Been Granted the USAGE Privilege on a Particular XSR Object Column Name Data Type Description GRANTOR VARCHAR (128) Grantor of the privilege GRANTORTYPE CHAR (1) S = Grantor is the system U = Grantor is an individual user GRANTEE VARCHAR (128) Holder of the privilege GRANTEETYPE CHAR (1) G = Grantee is a group R = Grantee is a role U = Grantee is an individual user OBJECTID BIGINT Identifier for the XSR object USAGEAUTH CHAR (1) Privilege to use the XSR object and its components N = Not held Y = Held G = Granted
508 Chapter 16 Table 16.10 XDB Map Managing XML Schemas SYSCAT.XDBMAPGRAPHS—Each Row Represents a Schema Graph for an Column Name Data Type Description OBJECTID BIGINT Unique generated identifier for an XSR object OBJECTSCHEMA VARCHAR(128) Schema name of the XSR object OBJECTNAME VARCHAR(128) Unqualified name of the XSR object SCHEMAGRAPHID INTEGER Schema graph identifier, which is unique within an XDB map identifier NAMESPACE VARCHAR(1001) Identifier for the namespace URI of the root element ROOTELEMENT VARCHAR(1001) Identifier for the element name of the root element Table 16.11 SYSCAT.XDBMAPSHREDTREES—Each Row Represents a Shred Tree for a Particular Schema Graph. Column Name Data Type Description OBJECTID BIGINT Unique generated identifier for an XSR object OBJECTSCHEMA VARCHAR(128) Schema name of the XSR object OBJECTNAME VARCHAR(128) Unqualified name of the XSR object SCHEMAGRAPHID INTEGER Schema graph identifier, which is unique within an XDB map identifier SHREDTREEID INTEGER Shred tree identifier, which is unique within an XDB map identifier MAPPINGDESCRIPTION CLOB(1M) Diagnostic mapping information 16.9.2 Queries against the XML Schema Repository You can use regular SQL to query the tables or views of the XML Schema Repository. This allows you to retrieve any type of information about the XML Schemas and their schema documents in the repository. We provide a few examples of useful XSR queries in this section, based on the XSR catalog views in DB2 for Linux, UNIX, and Windows. You can run these queries also in DB2 for z/OS if you change the table names (and in some cases column names) to their corresponding equivalents in DB2 for z/OS. The examples also include sample output that the queries produce after registering the schemas in section 16.4.4, where two XML Schemas share a common schema document.
16.9 Browsing the XML Schema Repository (XSR) 509 The query in Figure 16.25 lists all XML Schemas in the XSR, showing the relational schema they belong to, their SQL identifier, their target namespace, and status. The STATUS column tells you that the registration of both XML Schemas is complete. The substr function is used to limit the column width in the output of the DB2 Command Line Processor. SELECT SUBSTR(objectschema,1,10) AS rel_schema, SUBSTR(objectname,1,10) AS identifier, SUBSTR(targetnamespace,1,35) AS tgt_namespace, status FROM syscat.xsrobjects; REL_SCHEMA ---------DB2ADMIN DB2ADMIN IDENTIFIER ---------CUSTXSD2 SUPPLIER TGT_NAMESPACE ----------------------------------http://pureXMLcookbook.org http://pureXMLcookbook.org/supplier STATUS -----C C 2 record(s) selected. Figure 16.25 Listing all XML Schemas in the XML Schema Repository The query in Figure 16.26 can be used to list all XML Schemas and their internal OBJECTID. This OBJECTID is stored with every XML document that is validated against the respective XML Schema. SELECT SUBSTR(objectschema,1,10) AS rel_schema, SUBSTR(objectname,1,10) AS xml_schema, objectid FROM syscat.xsrobjects; REL_SCHEMA ---------DB2ADMIN DB2ADMIN XML_SCHEMA OBJECTID ---------- -------------------SUPPLIER 39969446692943104 CUSTXSD2 38562071809389824 2 record(s) selected. Figure 16.26 Listing XML Schemas and their OBJECTIDs Since an XML Schema can consist of multiple schema documents (components), the query in Figure 16.27 is useful to list all of these components with their schema location and the overall XML Schema that they belong to. You also see the internal COMPONENTID that DB2 has assigned to each schema document. The schema document addr.xsd has been registered twice, once for the customer schema and once for the supplier schema. The two copies of addr.xsd are distinguished by different COMPONENTIDs.
510 Chapter 16 Managing XML Schemas SELECT SUBSTR(objectname,1,10) AS schema, componentid, -- SUBSTR(targetnamespace,8,35) AS tgt_namespace, SUBSTR(schemalocation, 1,35) AS schema_location -- , create_time FROM syscat.xsrobjectcomponents ORDER BY create_time; SCHEMA ---------CUSTXSD2 CUSTXSD2 CUSTXSD2 SUPPLIER SUPPLIER COMPONENTID ----------------38843546786100480 39125021762811136 39406496739521792 40250921669653760 40532396646364416 SCHEMA_LOCATION ----------------------------------customer2.xsd phone.xsd addr.xsd supplier.xsd addr.xsd 5 record(s) selected. Figure 16.27 Listing all schema documents (components) The query in Figure 16.28 provides similar information, but also reveals exactly which components are primary schema documents. Those are flagged with a P in the HTYPE column and are at the top of the import/include hierarchy of an XML Schema. SELECT SUBSTR(o.objectname,1,25) AS schema, h.htype, SUBSTR(h.schemalocation,1,35) AS schema_location FROM syscat.xsrobjecthierarchies h, syscat.xsrobjects o WHERE h.objectid = o.objectID; SCHEMA ------------------------CUSTXSD CUSTXSD CUSTXSD SUPPXSD SUPPXSD HTYPE ----P D D P D SCHEMA_LOCATION -------------------------------customer2.xsd phone.xsd addr.xsd supplier.xsd addr.xsd 5 record(s) selected. Figure 16.28 16.10 Listing primary schema documents XML SCHEMA CONSIDERATIONS IN DB2 FOR Z/OS Where XML Schema management differs between DB2 for z/OS and DB2 for Linux, UNIX, and Windows, we have explicitly mentioned the respective platform throughout this chapter. Let’s summarize the main platform similarities and differences:
16.10 XML Schema Considerations in DB2 for z/OS 511 • Both DB2 for z/OS and DB2 for Linux, UNIX, and Windows have an XML Schema Repository that allows you to register XML Schemas. Schemas that consist of multiple schema documents are also supported. The schemas can then be used for document validation. • Neither DB2 for z/OS nor DB2 for Linux, UNIX, and Windows support validation with document type definitions (DTDs). We recommend that you use XML Schemas instead. • The XML Schema Repository in DB2 for Linux, UNIX, and Windows consists of catalog tables plus catalog views. In general, DB2 for z/OS does not use catalog views, but the same information is available in catalog tables. See Table 16.4 in section 16.9.1. • DB2 for z/OS keeps the object ID and name of an XML Schema in the columns XSROBJECTID and XSROBJECTNAME of the catalog table SYSIBM.XSROBJECTS. DB2 for Linux, UNIX, and Windows also keeps them in the columns OBJECTID and OBJECTNAME of the corresponding catalog view SYSCAT.XSROBJECTS. • When you register an XML Schema in DB2 for Linux, UNIX, and Windows you can optionally assign it to a relational schema of your choice. In DB2 for z/OS, this relational schema has to be SYSXSR. • DB2 for z/OS provides a stored procedure XSR_REMOVE to drop XML Schemas. In DB2 for Linux, UNIX, and Windows you use the DROP XSROBJECT command. • The function XMLXSROBJECTID is only supported in DB2 for Linux, UNIX, and Windows. It takes an XML document as input, such as an XML column name, and returns the OBJECTID of the XML Schema that the document was validated against. If you need similar capabilities in DB2 for z/OS you should maintain a schema identifier in a separate column of your user table. • At the time of writing, compatible schema evolution with the UPDATE XMLSCHEMA command or the XSR_UPDATE procedure is only available in DB2 for Linux, UNIX, and Windows. • Using XML Schemas in DB2 9 for z/OS requires Java JDK 1.5 or above installed with DB2 as well as the WLM environment setup for C stored procedures and Java stored procedures. • If you are installing DB2 9 for z/OS, installation job DSNTIJSG creates the tables and stored procedures that support XML Schema management. This job is part of the DB2 installation process. If you are migrating to DB2 9, run job DSNTIJNX after DB2 is in new-function mode to create the tables and stored procedures for XML Schema support. These objects must exist before the JDBC method registerDB2XMLSchema can be used.
512 16.11 Chapter 16 Managing XML Schemas SUMMARY XML Schemas provide rich capabilities to constrain XML documents. For example, an XML Schema can specify which elements and attributes are allowed to appear in a document, the order and nesting in which they may appear, the data types for their values, or that some elements are optional while others are mandatory. A process called validation can check whether an XML document complies with a given XML Schema. The use of XML Schemas in DB2 is optional. The decision to use XML Schemas depends on your application requirements to verify inserted or updated documents with an XML Schema. If you receive XML documents from a trusted source you may decide to avoid validation in DB2 and save the extra CPU cost. A trusted source can be, for example, an application server that already validates incoming XML documents so that additional validation in DB2 might not be necessary. If you decide to enforce XML data quality at the database level, you need to register the required XML Schema(s) in DB2’s XML Schema Repository (XSR). The XSR is a set of tables in DB2 for the storage and management of XML Schemas. Schema registration can be performed with DB2 Command Line Processor commands, stored procedure calls, or from a Java application with specific JDBC methods. Products, services, and processes tend to change over time in most enterprises. Often, such changes need to be reflected in the data that is captured and processed. Such changes lead to schema evolution. DB2 allows you to migrate from one version of an XML Schema to the next without any downtime, no matter how different the schemas are. High schema flexibility is one of the big advantages of XML data over relational data. The next chapter describes how registered XML Schemas can be used to validate XML documents in insert, update, or load operations and how you can find the XML Schema for a given document and vice versa.
C H A P T E R 17 Validating XML Documents against XML Schemas n Chapter 16, Managing XML Schemas, you learned how to register XML Schemas in DB2’s XML Schema Repository (XSR). In this chapter we explain the validation of XML documents using these XML Schemas. Remember that document validation is optional in DB2 and that there is no penalty in terms of DB2 performance or functionality if you don’t use an XML Schema. Document validation is also called schema validation because an XML Schema is used to validate XML documents. DB2 offers a variety of features to manage the validation of XML documents. You can I • Validate individual documents when you insert or update them (sections 17.1 and 17.2) • Validate documents without rejecting invalid documents (section 17.3) • Define check constraints to make document validation mandatory (section 17.4) • Use triggers to automatically validate every document that is inserted or updated (section 17.5) • Get detailed information on parsing and validation errors (section 17.6) • Validate batches of documents when you load or import them (section 17.7) • Check and validate existing XML documents in your database (sections 17.8 and 17.9) • Find the XML Schema for a given XML document, or find all documents for a given XML Schema (section 17.10) • Undo document validation, to disassociate an XML document from an XML Schema (section 17.11) Additionally, section 17.12 highlights some of the specific considerations for document validation in DB2 for z/OS. 513
514 17.1 Chapter 17 Validating XML Documents against XML Schemas DOCUMENT VALIDATION UPON INSERT In DB2 for Linux, UNIX, and Windows you can use the SQL/XML function XMLVALIDATE to validate an XML document when you insert or update it. In DB2 for z/OS, the validation function is called DSN_XMLVALIDATE. The parameters and usage for DSN_XMLVALIDATE are similar to XMLVALIDATE, and any differences are described in section 17.12. Let’s look at some examples using the same table as in the previous chapter: CREATE TABLE customer(id INTEGER, info XML) Figure 17.1 shows several INSERT statements as you would use them in an application program. The question marks denote parameter markers to which the application binds data values that are used in an INSERT statement. The first INSERT statement does not perform validation. It simply contains two parameter markers to insert one value into each of the two columns in the table. When this INSERT is executed, the value of the second parameter has to be a well-formed XML document. The second INSERT statement wraps the XMLVALIDATE function around the parameter marker for the XML column. This tells DB2 to validate the XML document as part of the insert operation. The clause ACCORDING TO XMLSCHEMA ID specifies the identifier of the XML Schema that should be used for validation. This is the identifier that you provided when you initially registered the XML Schema. It allows you to force validation against this particular XML Schema. If the inserted XML document is valid with respect to this schema, the INSERT succeeds, otherwise the INSERT fails with an error code that indicates why the document is not valid. The third INSERT statement works much like the second one, except that the schema identifier is not hard-coded but provided by the application through another parameter marker. The fourth INSERT statement identifies the XML Schema by the URI of its target namespace instead of its relational identifier. This approach can sometimes be more intuitive if the application usually identifies XML Schemas by URI rather than by DB2-specific identifiers. However, the URI must uniquely identify an XML Schema in the XML Schema Repository; otherwise, the INSERT fails with error SQL20335N. If the target namespace is not unique among your registered schemas, but the combination of namespace and schema location is, then you can reference the schema as in the fifth INSERT statement.
17.1 Document Validation Upon Insert 515 If an XML Schema has no target namespace then you can also reference it just by its schema location that you provided when you registered the Schema. In this case you use the keywords NO NAMESPACE LOCATION in the XMLVALIDATE function. These keywords are used in the sixth INSERT statement in Figure 17.1 and require the schema location to be unique in the XSR. -- (1) insert without validation: INSERT INTO customer(id, info) VALUES (?,?); -- (2) insert with validation, using the schema's SQL identifier: INSERT INTO customer(id, info) VALUES (?, XMLVALIDATE(? ACCORDING TO XMLSCHEMA ID db2admin.custxsd) ); -- (3) obtaining the schema identifier as a parameter INSERT INTO customer(id, info) VALUES (?, XMLVALIDATE(? ACCORDING TO XMLSCHEMA ID ?) ); -- (4) referencing the schema by its target namespace: INSERT INTO customer(id, info) VALUES (?, XMLVALIDATE(? ACCORDING TO XMLSCHEMA URI 'http://pureXMLcookbook.org') ); -- (5) referencing the schema by its namespace and schema location INSERT INTO customer(id, info) VALUES (?, XMLVALIDATE(? ACCORDING TO XMLSCHEMA URI 'http://pureXMLcookbook.org' LOCATION 'customer.xsd') ); -- (6) referencing a schema without a target namespace INSERT INTO customer(id, info) VALUES (?, XMLVALIDATE(? ACCORDING TO XMLSCHEMA NO NAMESPACE LOCATION 'customer.xsd') ); -- (7) relying on schemaLocation hints in the XML documents: INSERT INTO customer(id, info) VALUES (?, XMLVALIDATE(?) ); Figure 17.1 Insert statements with and without schema validation The seventh and last INSERT statement in Figure 17.1 uses the XMLVALIDATE function without specifying any particular XML Schema. In this case, DB2 looks at the incoming XML document to determine the XML Schema that should be used for validation. In particular, DB2 looks for a schemaLocation attribute whose value must be a pair of two URIs. The first URI is the target namespace of the XML Schema, and the second URI is the schema location that you specified when you registered the schema. The XML document in Figure 17.2 has such a schemaLocation attribute. The target namespace is originally declared in the XML Schema, such as the one in Figure 16.1. The location URI was given to the XML Schema during registration in Figure 16.9. (See also section 15.1, Introduction to XML Namespaces, for more information on URIs.)
516 Chapter 17 Validating XML Documents against XML Schemas <customerinfo xmlns="http://pureXMLcookbook.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pureXMLcookbook.org customer.xsd" Cid="1004"> <name>Matt Foreman</name> <addr country="Canada"> <street>1596 Baseline</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M3Z 5H9</pcode-zip> </addr> <phone type="work">905-555-4789</phone> </customerinfo> Figure 17.2 An XML document with a schemaLocation attribute Considering the options in Figure 17.1, there is no single best way to identify an XML Schema. People with a strong relational database background often prefer to reference a schema by its SQL identifier. This identifier is the OBJECTNAME in the catalog view syscat.xsrobjects. XML-oriented people might prefer to reference a schema by its target namespace, schema location, or both. Either way, the effect is the same. Figure 17.3 shows how to look up a schema’s target namespace and schema location in the XML Schema Repository. In DB2 for z/OS this query would use the table sysibm.xsrobjects. SELECT SUBSTR(objectschema,1,10) AS rel_schema, SUBSTR(objectname,1,10) AS name, SUBSTR(schemalocation,1,15) AS schemalocation, SUBSTR(targetnamespace,1,30) AS tgt_namespace FROM syscat.xsrobjects WHERE objectname = 'CUSTXSD'; REL_SCHEMA NAME SCHEMALOCATION TGT_NAMESPACE ---------- ---------- --------------- -------------------------DB2ADMIN CUSTXSD customer.xsd http://pureXMLcookbook.org 1 record(s) selected. Figure 17.3 Retrieving schema information for a given schema name If you want to test document validation with INSERT statements in the DB2 Command Line Processor (CLP), you can include a literal XML document instead of a parameter marker (see Figure 17.4). This method works for any of the types of INSERT statements in Figure 17.1. The XMLPARSE function converts the textual XML document, which is a string, into the XML data type. This data type conversion is required because the XMLVALIDATE function expects a value of type XML as input. If a parameter marker is used instead of a literal string value, then this type conversion happens automatically, does not require the XMLPARSE function, and is hence also known as implicit parsing.
17.1 Document Validation Upon Insert 517 INSERT INTO customer(id, info) VALUES (1007, XMLVALIDATE(XMLPARSE (document '<?xml version="1.0" encoding="UTF-8" ?> <customerinfo Cid="1007"> <name>Kathy Smith</name> <addr country="Canada"><street>5 Rosewood</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M6W 1E6</pcode-zip> </addr> <phone type="work">416-555-1358</phone> </customerinfo>') ACCORDING TO XMLSCHEMA ID db2admin.custxsd)); Figure 17.4 Validating a literal document against an XML Schema Validation always implies that boundary whitespace is stripped, not preserved, in both DB2 for z/OS and DB2 for Linux, UNIX, and Windows. You cannot insert, update, or load documents with schema validation and preserve whitespace at the same time. If an XML Schema defines default values for elements or attributes, these values get inserted into the document during validation. NOTE When you insert a document with schema validation and the document does not comply with the schema, DB2 returns an error code. The INSERT statement fails and the document is rejected. For example, if an element name is misspelled (such as citys rather than city) or has the wrong case (such as City rather than city) and DB2 cannot find a corresponding element definition in the XML Schema, it issues the following error: SQL16196N XML document contains an element "citys" that is not correctly specified. Reason code = "37" SQLSTATE=2200M. There are dozens of different error codes for the various types of schema violations, and all of them belong to SQLSTATE 2200M. See the DB2 Information Center for all error codes. In general, applications are in control of the INSERT statements that they send to the database server. They can choose whether to use the XMLVALIDATE function and which XML Schema to reference for validation. Document validation with the XMLVALIDATE function in INSERT statements is a decentralized and application-driven approach to ensuring data quality. This approach offers a lot of flexibility and can be useful in dynamic and heterogeneous application environments. In other situations more centralized control is desirable. One way to achieve a more centralized control is to allow inserting and updating of XML documents only through stored procedures that are centrally defined at the database server. Alternatively, you can define check constraints and triggers, which are discussed in sections 17.4 and 17.5, respectively.
518 17.2 Chapter 17 Validating XML Documents against XML Schemas DOCUMENT VALIDATION UPON UPDATE When an XML document is modified there is a chance that it is no longer valid for a given XML Schema. If you don’t trust applications to modify documents in a valid manner, you may want to perform validation each time a document is updated. Even if a document has been validated upon INSERT, it does not get automatically revalidated upon UPDATE. The explicit use of the XMLVALIDATE function in UPDATE statements is required—unless a validation trigger is defined. Figure 17.5 shows four UPDATE statements with parameter markers. The first statement performs a full-document replacement without validation. Even if the original document was validated and associated with an XML Schema in the XML Schema Repository, the new document is not associated with any XML Schema after this UPDATE statement is executed. The second and third UPDATE statements also perform full-document replacement but they use the XMLVALIDATE function around the parameter marker to enforce validation of the new document. This is analogous to the use of XMLVALIDATE in Figure 17.1. The fourth UPDATE statement uses the XMLQUERY function and an XQuery Update expression to only change the value of the element pcode-zip. The XQuery Update expression takes the existing document as input ($INFO) and produces a modified document. The XMLVALIDATE function ensures that the modified document is valid against an XML Schema whose identifier is provided through a parameter marker. UPDATE customer SET info = ? WHERE id = 1003; UPDATE customer SET info = XMLVALIDATE(? ACCORDING TO XMLSCHEMA ID db2admin.custxsd) WHERE id = 1003; UPDATE customer SET info = XMLVALIDATE(? ACCORDING TO XMLSCHEMA URI ?) WHERE id = 1003; UPDATE customer SET info = XMLVALIDATE( XMLQUERY(' copy $new := $INFO modify do replace value of $new/*:customerinfo/*:addr/*:pcode-zip with "95123" return $new') ACCORDING TO XMLSCHEMA ID ?) WHERE id = 1003; Figure 17.5 Update statements with and without XML Schema validation
17.3 Validation without Rejecting Invalid Documents 519 Similarly, Figure 17.6 shows an UPDATE statement that replaces an existing document with a new document that is provided as a literal value rather than through a parameter marker. UPDATE customer SET info = XMLVALIDATE(XMLPARSE(DOCUMENT '<customerinfo Cid="1007"> <name>Kathy Smith</name> <addr country="Canada"> <street>5 New Street</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M6W 1E6</pcode-zip> </addr> <phone type="work">416-555-1358</phone> </customerinfo>') ACCORDING TO XMLSCHEMA ID db2admin.custxsd) WHERE id = 1007; Figure 17.6 17.3 XML update with a literal document and validation VALIDATION WITHOUT REJECTING INVALID DOCUMENTS You might have an application where rejecting invalid XML documents with an error is not the appropriate course of action. In this case you may wish to store the invalid document anyway so that it doesn’t get lost. At the same time you can record the fact that the document is invalid together with the reason for the schema violation. This can be done if the INSERT statement is encapsulated in a stored procedure with exception handling. Figure 17.6 creates such a stored procedure along with an exception table customer_invalid that has a status column to hold error information for invalid documents. This stored procedure takes the same input parameters as the INSERT statements in the previous section: an id number and an XML document. Optionally it could take a schema identifier as a third parameter, but in this example we use the fixed schema db2admin.custxsd. The actual body of the procedure consists merely of the INSERT statement at its end. This statement tries to insert the row into the customer table, using the XMLVALIDATE function for validation of the XML document. If this succeeds, the procedure exits and no further action is taken. If the document is invalid for the specified XML Schema, this INSERT statement fails and the declared exception handler kicks in. The exception handler is an EXIT handler, which means that in case of a validation error (SQLSTATE 2200M), the statements in the BEGIN-END block of the handler are executed before the procedure exits. The handler first uses the GET DIAGNOSTICS statement to obtain the error message for the failed INSERT statement. Then it inserts the id value and the XML document without validation into the table customer_invalid, and places the error message into the status column of the same row. This method allows you to retain the invalid documents and re-examine them at a later time. They must be well-formed, but don’t have to be valid.
520 Chapter 17 Validating XML Documents against XML Schemas CREATE TABLE customer_invalid(id INTEGER, info XML, status VARCHAR(300))# CREATE PROCEDURE myinsert(IN id INTEGER, IN doc XML) LANGUAGE SQL BEGIN DECLARE errormsg VARCHAR(300); DECLARE errortoken VARCHAR(50); DECLARE INVALID_DOCUMENT CONDITION FOR '2200M'; DECLARE EXIT HANDLER FOR INVALID_DOCUMENT BEGIN GET DIAGNOSTICS EXCEPTION 1 errortoken = DB2_TOKEN_STRING, errormsg = MESSAGE_TEXT; INSERT INTO customer_invalid(id, info, status) VALUES(id, doc, errormsg); END; INSERT INTO customer(id, info) VALUES(id, XMLVALIDATE(doc ACCORDING TO XMLSCHEMA ID db2admin.custxsd)); END # Figure 17.7 Stored procedure to handle and record validation errors Since the body of a stored procedure can contain multiple statements, these statements have to be separated by the semicolon character. Therefore, the CLP cannot use the semicolon as the terminating character for the CREATE PROCEDURE statement. In this example we have chosen the # as the terminating character. If the procedure definition shown in Figure 17.7 is in a file create_insert_proc.sql then the following command issued at the OS prompt creates the procedure: db2 -td# -f create_insert_proc.sql The option –td# tells the CLP that the # sign is used as the terminating character. 17.4 ENFORCING VALIDATION WITH CHECK CONSTRAINTS The explicit use of the XMLVALIDATE function in INSERT and UPDATE statements leaves document validation under the control of the individual applications. Applications can choose whether to perform validation and which of the XML Schemas in the XSR to reference. One way to restrict this flexibility and to enforce validation is through the use of check constraints. Another is the use of triggers, discussed in the next section.
17.4 Enforcing Validation with Check Constraints 521 In DB2 for Linux, UNIX, and Windows you can use check constraints that force applications to validate XML documents during insert and update, and to reference specific schemas for validation. Such constraints prevent non-validated documents from entering an XML column. DB2 for z/OS currently does not support this feature. The ALTER TABLE statement in Figure 17.8 defines a simple check constraint with the IS VALIDATED predicate. The constraint, which is named val_customer, is defined on the XML column info of the customer table, and requires all XML documents in this column to be validated. This constraint means that applications cannot perform INSERT or UPDATE operations on the XML column without using the XMLVALIDATE function. However, applications still have the freedom to choose any registered XML Schema for validation. ALTER TABLE customer ADD CONSTRAINT val_customer CHECK (info IS VALIDATED) Figure 17.8 Adding a constraint to enforce validation against any XML Schema A check constraint itself does not perform validation. It only rejects INSERT or UPDATE statements that do not perform validation. Validation needs to be performed with the XMLVALIDATE function in INSERT and UPDATE statements as explained in the previous sections. The check constraint ensures that validation is no longer optional for the info column. If an application tries to insert or update an XML document without using XMLVALIDATE, DB2 returns the following error message: SQL0545N The requested operation is not allowed because a row does not satisfy the check constraint "CUSTOMER.VAL_CUSTOMER". SQLSTATE=23513. You cannot add the check constraint in Figure 17.8 if the info column of the customer table already contains documents that have not been validated. In this case, the ALTER TABLE statement fails with the following message: SQL0544N The check constraint "VAL_CUSTOMER" cannot be added because the table contains a row that violates the constraint. SQLSTATE=23512 The query in Figure 17.9 identifies the documents in the info column that have not been validated. Section 17.9 explains how to validate or revalidate documents that are already in the table. SELECT id FROM customer WHERE info IS NOT VALIDATED Figure 17.9 Finding non-validated documents
522 Chapter 17 Validating XML Documents against XML Schemas To constrain an XML column even more, you might want to enforce document validation not just against any XML Schema, but against a specific XML Schema. This is possible since DB2 9.5 for Linux, UNIX, and Windows where the IS VALIDATED predicate can have an optional ACCORDING TO XMLSCHEMA clause (Figure 17.10). This clause allows you to specify a particular registered XML Schema either by its identifier or URI. ALTER TABLE customer ADD CONSTRAINT val_customer CHECK (info IS VALIDATED ACCORDING TO XMLSCHEMA ID db2admin.custxsd) Figure 17.10 A constraint to enforce validation against a specific XML Schema The constraint in Figure 17.10 forces all applications to always validate XML documents in INSERT and UPDATE statements against the schema db2admin.custxsd. The ALTER TABLE statement fails if there are already documents in the column that haven’t been validated against this specific schema. Forcing all applications to use the same XML Schema, and the same version of that XML Schema, can sometimes be too restrictive. There may be cases when you want to store XML documents for several different XML Schemas in one XML column. In this case you can create a check constraint with a list of allowed schemas. For example, the constraint in Figure 17.11 allows applications to insert and update XML documents as long as the XMLVALIDATE function is used to validate them against any of the three schemas listed in the constraint definition. ALTER TABLE customer ADD CONSTRAINT val_customer CHECK (info IS VALIDATED ACCORDING TO XMLSCHEMA IN (ID db2admin.custxsd, ID db2admin.custxsd_V2, ID db2admin.custxsd_V3) ) Figure 17.11 A constraint to enforce validation against one of several schemas You cannot alter a constraint. If you want to change it you have to drop it and re-create it. To drop a constraint, use the DROP CONSTRAINT option of the ALTER TABLE command: ALTER TABLE customer DROP CONSTRAINT val_customer Check constraints are evaluated after the INSERT/UPDATE statement is processed, and after any triggers have fired. If a check constraint is violated, the current transaction is rolled back and an error is raised. Check constraints can be very useful, but they do not relieve the application from using the XMLVALIDATE function in INSERT and UPDATE operations. This relief can be achieved with triggers, which are discussed in the next section.
17.5 Automatic Validation with Triggers 523 When you examine an existing database, you may want to find out whether any such constraints are defined. This is easily done by querying the SYSCAT.CHECKS catalog view, as shown in Figure 17.12. SELECT SUBSTR(constname,1,15) AS constraint_name, SUBSTR(tabname,1,10) AS tabname, SUBSTR(text,1,90) AS text FROM syscat.checks; CONSTRAINT_NAME TABNAME TEXT --------------- ---------- ----------------------------------VAL_CUSTOMER CUSTOMER INFO IS VALIDATED ACCORDING TO XMLSCHEMA IN (ID DB2ADMIN.CUSTXSD) Figure 17.12 17.5 Listing check constraints AUTOMATIC VALIDATION WITH TRIGGERS Triggers allow you to perform automatic document validation in DB2 even if applications do not use the XMLVALIDATE function in INSERT or UPDATE statements. Specifically, you can create BEFORE triggers that inject the XMLVALIDATE function any time an INSERT or UPDATE statement is processed. Figure 17.13 shows the definition of a BEFORE INSERT trigger on the customer table. This trigger fires for every INSERT statement on the customer table. The third line of the trigger definition declares that the new row that is being inserted can be referenced as a variable called newrow. In general, the body of a trigger is a block of one or multiple statements between the keywords BEGIN and END. The trigger in Figure 17.13 uses only a single SET statement. It replaces the value of the XML column info in the new row with the same value validated by the XMLVALIDATE function. In other words, the SET statement ensures that a new XML document is not inserted as is, but goes through the XMLVALIDATE function first. If the new document is valid against the schema db2admin.custxsd, the INSERT succeeds. Otherwise it fails with an error message detailing the error. CREATE TRIGGER validate_customer_ins BEFORE INSERT ON customer REFERENCING NEW AS newrow FOR EACH ROW MODE DB2SQL BEGIN ATOMIC SET newrow.info = XMLVALIDATE(newrow.info ACCORDING TO XMLSCHEMA ID db2admin.custxsd); END # Figure 17.13 A BEFORE trigger for document validation upon insert
524 Chapter 17 Validating XML Documents against XML Schemas Similar to the stored procedure in Figure 17.7, the body of a trigger can contain multiple statements, which have to be terminated with the semicolon character. Therefore you have to use a different character to terminate the CREATE TRIGGER statement itself, such as the # used in this example. The trigger in Figure 17.13 ensures automatic validation only for new document inserts. You typically require a second similar trigger that automates validation when documents are updated. An example of an update trigger is shown in Figure 17.14. The trigger in Figure 17.13 applies the XMLVALIDATE function regardless of whether the original INSERT statement that fired the trigger contains an XMLVALIDATE function or not. This trigger always overrides any validation that the application may have specified in its INSERT and UPDATE statements. This behavior can be desirable for maximum control at the database level. If you want to allow for more flexibility you can define a conditional trigger with a WHEN clause (see Figure 17.14). This clause ensures that the trigger is executed only if the new XML document is not being validated with an XMLVALIDATE function in the original INSERT or UPDATE statement. Simply put, this trigger validates whenever the application does not specify validation, but it never overrides the validation that the application may have specified. For example, you can use such a conditional trigger to allow applications to use any registered XML Schema for validation. At the same time the trigger ensures that documents are validated with a default schema if the application does not specify any validation in its INSERT and UPDATE statements. Additionally you can restrict an application’s choice of schema with a check constraint. CREATE TRIGGER validate_customer_upd BEFORE UPDATE OF info ON customer REFERENCING NEW AS newrow FOR EACH ROW MODE DB2SQL WHEN (newrow.info IS NOT VALIDATED) BEGIN ATOMIC SET newrow.info = XMLVALIDATE(newrow.info ACCORDING TO XMLSCHEMA ID db2admin.custxsd); END # Figure 17.14 A conditional UPDATE trigger for validation If you want to enforce that a document is valid for multiple schemas you can have multiple SET statements in the trigger, such as in Figure 17.15. Upon insert of a document, this trigger performs validation first with the schema custxsd, then with the schema custxsd_V2. The insert succeeds only if both validations are successful, in which case the validated and stored document will reference the schema it was last validated with (custxsd_V2).
17.6 Diagnosing Validation and Parsing Errors 525 CREATE TRIGGER insert_customer_ins2 BEFORE UPDATE OF info ON customer REFERENCING NEW AS newrow FOR EACH ROW MODE DB2SQL BEGIN ATOMIC SET newrow.info = XMLVALIDATE(newrow.info ACCORDING TO XMLSCHEMA ID db2admin.custxsd); SET newrow.info = XMLVALIDATE(newrow.info ACCORDING TO XMLSCHEMA ID db2admin.custxsd_V2); END # Figure 17.15 Triggered validation with multiple schemas Beware that validation against multiple XML Schemas can significantly increase the CPU consumption.Try to combine the multiple schemas into a single schema for more efficient validation. NOTE 17.6 DIAGNOSING VALIDATION AND PARSING ERRORS If a document fails to validate against an XML Schema, DB2 produces an appropriate error message that might, depending on the message, contain a reason code that indicates the cause of the schema violation. If you insert or update a document without validation, errors are also raised if the document is not well-formed. Depending on the size and complexity of your documents or XML Schema it can sometimes be difficult to identify the exact spot in your document that causes the validation or parsing error. To help you understand and resolve validation and parsing errors, DB2 for Linux, UNIX, and Windows has introduced a stored procedure called XSR_GET_PARSING_DIAGNOSTICS. It is included in DB2 9.5 since Fixpack 3 as well as in DB2 9.7. It is not available for DB2 9.1. If an XML document is not well-formed or invalid for a given XML Schema, invoke the XSR_GET_PARSING_DIAGNOSTICS procedure with the document and optionally the XML Schema as input. The procedure produces detailed error information, including: • The line and column number of the error position in the textual XML document • An XPath that points to the error location in the document, if possible • The original error message, reason code, and any applicable error tokens Figure 17.16 shows the syntax and parameters of the XSR_GET_PARSING_DIAGNOSTICS procedure, and Table 17.1 explains the parameters.
526 Chapter 17 Validating XML Documents against XML Schemas >>-XSR_GET_PARSING_DIAGNOSTICS--(--document--,--relschema--,--> >--xmlSchemaName--,--schemaLocation--,--implicitValidation-,--> >--errorReport--,--errorCount--)----------------------------->< Figure 17.16 Table 17.1 Syntax of the stored procedure XSR_GET_PARSING_DIAGNOSTICS Parameters of the Procedure XSR_GET_PARSING_DIAGNOSTICS Parameter Purpose document The XML document, provided as a BLOB(30M). Cannot be NULL. relschema, xmlSchemaName Two optional input parameters of type VARCHAR(128) that provide the two part SQL identifier of an XML Schema. For example, if the XML Schema is db2admin.custxsd, then relschema should receive the value db2admin and xmlSchemaName the value custxsd. schemaLocation The schema location URI that you specified when you registered the schema. This is an optional alternative way to specify an XML Schema for validation. It can be NULL. implicitValidation Input parameter, must be either 0 or 1, cannot be NULL. 1 means that the document is validated against the schema that is specified by an xsi:schemaLocation attribute within the XML document itself. 0 means that the document is validated against the schema identified by relschema.xmlSchemaName, and if relschema and xmlSchemaName are NULL then the document is not validated. errorReport An output parameter of type VARCHAR(32000) that contains the error information in XML format. errorCount Output parameter for the number of reported errors (INTEGER). In Figure 17.17 you see a sample invocation of the XSR_GET_PARSING_DIAGNOSTICS procedure in the DB2 Command Line Processor (CLP) as well as the error report produced. The input document is cast to BLOB to match the parameter type of the stored procedure. Since no XML Schema information is provided, this invocation of the procedure only reports well-formedness errors. Note that the closing tag of the name element is misspelled in the document. The error report reveals that the problem is at character 24 in line 2 of the document, in the element identified by /customerinfo/name.
17.6 Diagnosing Validation and Parsing Errors CALL xsr_get_parsing_diagnostics( BLOB('<customerinfo Cid="1008"> <name>Kathy Smith</nam> <addr country="Canada"><street>5 Rosewood</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M6W 1E6</pcode-zip> </addr> <phone type="work">416-555-1358</phone> </customerinfo>'),'','','',0,?,?); Value of output parameters -------------------------Parameter Name : ERRORDIALOG Parameter Value : <ErrorLog> <XML_FatalError parser="XML4C"> <errCode>202</errCode> <errDomain>http://apache.org/xml/messages/XMLErrors</errDomain> <errText>Expected end of tag 'name'</errText> <lineNum>2</lineNum> <colNum>24</colNum> <location>/customerinfo/name</location> <schemaType></schemaType> <tokenCount>1</tokenCount> <token1>name</token1> </XML_FatalError> <DB2_Error> <sqlstate>2200M<sqlstate/> <sqlcode>-16129<sqlcode/> <errText>[IBM][CLI Driver][DB2/AIX64] SQL16129N XML document expected end of tag "name". SQLSTATE=2200M<errText/> </DB2_Error> </ErrorLog> Parameter Name : ERRORCOUNT Parameter Value : 1 Figure 17.17 Obtaining an error report for an XML parsing error If you use the procedure XSR_GET_PARSING_DIAGNOSTICS in the DB2 Command Line Processor with a hardcoded XML NOTE document as input, as shown in Figure 17.17, make sure that you invoke the CLP with the -q option (db2 -q -t).Without the -q option the input document is always treated as a single long line so that any error is always reported to be in line 1. This is because by default the CLP strips all new-line characters from any submitted input before sending it to the DB2 server. The -q option forces the CLP to retain all whitespace and new-line characters, which ensures that the line and column information in the error report are correct. 527
528 Chapter 17 Validating XML Documents against XML Schemas Figure 17.18 illustrates an invocation of the XSR_GET_PARSING_DIAGNOSTICS procedure that checks a document for validation errors against the XML Schema db2admin.custxsd. That’s the same XML Schema as we used in Chapter 16 (see Figure 16.1 in section 16.2, Anatomy of an XML Schema). The relevant parts of that XML Schema are repeated in Figure 17.19. CALL xsr_get_parsing_diagnostics( BLOB('<customerinfo xmlns="http://pureXMLcookbook.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pureXMLcookbook.org customer.xsd" Cid="5D8k17"> <name>Kathy Smith</name> <addr country="Canada"><street>5 Rosewood</street> <city>Toronto</city> <pcode-zip>M6W 1E6</pcode-zip> <prov-state>Ontario</prov-state> </addr> <phone>416-555-1358</phone> </customerinfo>'),'db2admin','custxsd','',1,?,?); Figure 17.18 Obtaining an error report for schema validation The excerpts of the XML Schema show that the Cid attribute must have a value of type xs:integer, and that the pcode-zip element has to be the last child element of addr. The document in Figure 17.18 does not comply with these rules and schema validation fails. (…) <xs:complexType name="addrType"> <xs:sequence> <xs:element name="street" type="xs:string"/> <xs:element name="city" type="xs:string"/> <xs:element name="prov-state" type="xs:string"/> <xs:element name="pcode-zip" type="xs:string"/> </xs:sequence> <xs:attribute name="country" type="xs:string"/> </xs:complexType> (…) <xs:attribute name="Cid" type="xs:integer" /> Figure 17.19 Excerpts of the XML Schema from Figure 16.1 Figure 17.20 shows the error report for the document in Figure 17.18. At the bottom of the error report, note that the error count is 2. The error report shows both validation errors, while regular validation in DB2 would produce an error message only for the first of the two errors (shown in element <DB2_Error>). The first error explains that the value 5D8k17 in line 5 of the XML document does match the data type in the XML Schema. The second error reveals that the element pcode-zip in the document violates a so-called content model in the XML Schema, which
17.6 Diagnosing Validation and Parsing Errors 529 defines that pcode-zip must be last in the following sequence of elements: ((street,city, prov-state),pcode-zip). Value of output parameters -------------------------Parameter Name : ERRORDIALOG Parameter Value : <ErrorLog> <XML_Error parser="XML4C"> <errCode>238</errCode> <errDomain>http://apache.org/xml/messages/XML4CErrors </errDomain> <errText>Datatype error: Type:InvalidDatatypeValueException, Message:Value '5D8k17' does not match regular expression facet '[+\-]?[0-9]+'.</errText> <lineNum>5</lineNum> <colNum>22</colNum> <location></location> <schemaType>http://www.w3.org/2001/XMLSchema:anyType </schemaType> <tokenCount>2</tokenCount> <token1>5D8k17</token1> <token2>13</token2> </XML_Error> <XML_Error parser="XML4C"> <errCode>7</errCode> <errDomain>http://apache.org/xml/messages/XMLValidity </errDomain> <errText>Element 'pcode-zip' is not valid for content model: '((street,city,prov-state),pcode-zip)'</errText> <lineNum>11</lineNum> <colNum>10</colNum> <location>/customerinfo/addr</location> <schemaType>http://www.w3.org/2001/XMLSchema:string</schemaType> <tokenCount>2</tokenCount> <token1>pcode-zip</token1> <token2>31</token2> </XML_Error> <DB2_Error> <sqlstate>2200M<sqlstate/> <sqlcode>-16210<sqlcode/> <errText>[IBM][CLI Driver][DB2/AIX64] SQL16210N XML document contained a value "5D8k17" that violates a facet constraint. Reason code = "13". SQLSTATE=2200M<errText/> </DB2_Error> </ErrorLog> Parameter Name : ERRORCOUNT Parameter Value : 2 Figure 17.20 Error report for a document that fails schema validation
530 17.7 Chapter 17 Validating XML Documents against XML Schemas VALIDATION DURING LOAD AND IMPORT OPERATIONS DB2’s LOAD and IMPORT utilities allow you to efficiently move large numbers of XML documents into a table. These utilities are discussed in Chapter 5, Moving XML Data. The LOAD utility in DB2 for z/OS does not yet support XML Schema validation. Hence, this section focuses on the LOAD and IMPORT utilities in DB2 for Linux, UNIX, and Windows. Both LOAD and IMPORT have options to validate XML documents against XML Schemas. In fact, both utilities offer the same options and the same command syntax for XML Schema handling. In this chapter we focus on the IMPORT utility, but all demonstrated options can be used identically with the LOAD utility. These options include • Validate all imported or loaded documents against a single XML Schema (section 17.7.1) • In a single IMPORT or LOAD command, validate different documents against different XML Schemas (section 17.7.2) • Specify a default XML Schema for IMPORT or LOAD to also validate documents for which no specific XML Schema is explicitly declared in the delimited format input file (section 17.7.3) • Disable validation for some of the XML Schemas specified in the input file (section 17.7.4) • Selectively validate documents against a different schema than the one specified in the input file (also in section 17.7.4) • Validate documents against XML Schemas that are referenced in schema location attributes in the XML documents (section 17.7.5) 17.7.1 Validation against a Single XML Schema When you import or load a set of XML documents, simply add the clause XMLVALIDATE USING SCHEMA to the LOAD or IMPORT command and specify the SQL identifier of the XML Schema that you want to use for validation. This is illustrated in the IMPORT command in Figure 17.21 and the LOAD command in Figure 17.22. The file load_customer.txt is a delimited format file (DEL), which contains one line for each row that is being processed. This file contains values for relational columns but no XML data for XML columns. For XML columns, it contains references to separate XML files, which are located in the directory that is specified in the XML FROM clause. IMPORT FROM c:\xml\load_customer.txt OF DEL XML FROM c:\xml XMLVALIDATE USING SCHEMA db2admin.custxsd INSERT INTO customer Figure 17.21 Performing XML Schema validation during IMPORT
17.7 Validation during Load and Import Operations 531 LOAD FROM c:\xml\load_customer.txt OF DEL XML FROM c:\xml XMLVALIDATE USING SCHEMA db2admin.custxsd INSERT INTO customer Figure 17.22 Performing XML Schema validation during LOAD The XMLVALIDATE clause and its various options are identical for the LOAD and IMPORT commands. All subsequent examples therefore only show the IMPORT command. 17.7.2 Validation against Multiple XML Schemas There can be cases when you don’t want to validate all documents in a single LOAD or IMPORT operation against the same XML Schema. In this case you might be able to group your XML documents into multiple sets, one for each XML Schema. Then you can issue several individual LOAD or IMPORT commands to load each set separately, each time specifying the appropriate XML Schema. DB2 also allows you to use multiple XML Schemas in a single LOAD or IMPORT command. This requires explicit schema references in the delimited format input file. Figure 17.23 shows three lines of a delimited format input file. Each line contains two entries, one for each of the two columns. The first entry is an integer value for the id column of the customer table; the second entry is an XML Data Specifier (XDS) that carries two attributes. The first attribute (FIL) points to an XML file that is to be imported, and the second attribute (SCH) provides the SQL identifier of an XML Schema. As a result, each XML document can be validated against a different XML Schema. 2000,"<XDS FIL='data2.xml' SCH='DB2ADMIN.CUSTXSD1' />" 2001,"<XDS FIL='data3.xml' SCH='DB2ADMIN.CUSTXSD2' />" 2002,"<XDS FIL='data4.xml' SCH='DB2ADMIN.CUSTXSD1' />" Figure 17.23 Schema identifiers in the delimited format input file The input file in Figure 17.23 tells DB2 to use the XML Schema CUSTXSD1 to validate the XML documents contained in files data2.xml and data4.xml, and the schema CUSTXSD2 to validate the XML document data3.xml. Additionally you need to include the XMLVALIDATE USING XDS clause in the IMPORT or LOAD command (see Figure 17.24). Otherwise the SCH attributes in the input are ignored and no validation is performed. IMPORT FROM c:\xml\load_customer.txt OF DEL XML FROM c:\xml XMLVALIDATE USING XDS INSERT INTO customer Figure 17.24 Performing XML Schema validation during IMPORT with multiple schemas
532 Chapter 17 Validating XML Documents against XML Schemas What happens if the delimited format input file contains schema references (SCH attributes) but you use the XMLVALIDATE USING SCHEMA <schemaID> clause in the LOAD or IMPORT command? In this case the XML Schema specified in the XMLVALIDATE USING SCHEMA clause takes precedence, all documents are validated against that one schema, and the SCH attributes in the input file are ignored. For a large number of documents you normally don’t create the delimited format input file manually—you may have an application or script that creates it for you. Also, note that DB2’s EXPORT utility can export tables (or subsets of a table defined by a query) to the file system. When you export XML data, the EXPORT utility automatically generates a delimited format file and optionally includes SCH attributes with schema identifiers for all documents that have been validated. Samples of the output produced by EXPORT are shown in Figure 17.23, Figure 17.25, and Figure 17.27. 17.7.3 Using a Default XML Schema When schema references are included in the delimited format input file, it is possible that not every XDS has a SCH attribute (see Figure 17.25). In this case, the LOAD and IMPORT commands allow you to specify a default schema for those records that do not have a SCH attribute in the input file. 2000,"<XDS FIL='data2.xml' />" 2001,"<XDS FIL='data3.xml' SCH='DB2ADMIN.CUSTXSD2' />" 2002,"<XDS FIL='data4.xml' />" Figure 17.25 Schema identifiers in the delimited format input file The IMPORT command in Figure 17.26 contains the DEFAULT option in the XMLVALIDATE USING XDS clause to indicate that any input documents that don’t have a schema reference in the XDS must be validated against the schema custxsd1. IMPORT FROM c:\xml\load_customer.txt OF DEL XML FROM c:\xmldata XMLVALIDATE USING XDS DEFAULT db2admin.custxsd1 INSERT INTO customer Figure 17.26 Specifying a default schema for validation Note that the DEFAULT clause takes precedence over the IGNORE and MAP clauses (discussed in the next sections). 17.7.4 Overriding XML Schema References Assume you need to import XML data using the delimited format input file in Figure 17.27. This input file contains references to XML Schemas custxsd1, custxsd2, and custxsd3.
17.7 Validation during Load and Import Operations 2000,"<XDS 2001,"<XDS 2001,"<XDS 2002,"<XDS Figure 17.27 FIL='data2.xml' FIL='data3.xml' FIL='data3.xml' FIL='data4.xml' SCH='DB2ADMIN.CUSTXSD1' SCH='DB2ADMIN.CUSTXSD3' SCH='DB2ADMIN.CUSTXSD2' SCH='DB2ADMIN.CUSTXSD1' 533 />" />" />" />" Schema identifiers in the delimited format input file Let’s say you only want to validate the documents that reference schema custxsd1, but not the documents that reference custxsd2 or custxsd3. One reason could be that you received the input data but you only have schema custxsd1 and not the other two. Another reason could be that the documents for schemas custxsd2 and custxsd3 are already known to be valid and you want to save the CPU cycles of validating them again. In such cases you can add the IGNORE keyword with a list of schema identifiers to the XMLVALIDATE USING XDS clause. An example is shown in Figure 17.28. It tells DB2 to perform validation based on the schemas specified in the SCH attributes, but not to validate any documents that reference any of the schemas listed in the IGNORE clause. IMPORT FROM c:\xml\tab.txt OF DEL XML FROM c:\xmldata XMLVALIDATE USING XDS IGNORE (db2admin.custxsd2, db2admin.custxsd3) INSERT INTO customer Figure 17.28 Disabling validation for selected XML Schemas Instead of ignoring certain XML Schemas you can also override them with a different schema. The MAP clause allows you to specify alternate XML Schemas to use in place of those specified by the SCH attributes in the delimited format input file. The MAP clause specifies a list of one or more XML Schema pairs, where each pair represents a mapping from one XML Schema to another. The first XML Schema in the pair represents a schema that is referenced by an SCH attribute in an XDS. The second XML Schema in the pair represents the schema that should be used to perform validation. An example is shown in Figure 17.29, where the IMPORT command uses the schema custxsd1 whenever it sees schema custxsd2 or custxsd3 in an SCH attribute in the input file. IMPORT FROM c:\xml\tab.txt OF DEL XML FROM c:\xmldata XMLVALIDATE USING XDS MAP ((db2admin.custxsd2, db2admin.custxsd1), (db2admin.custxsd3, db2admin.custxsd1)) INSERT INTO customer Figure 17.29 Import with validation against “mapped” XML Schemas
534 Chapter 17 Validating XML Documents against XML Schemas The following usage rules apply: • If an XML Schema is present in the left side of a schema pair in the MAP clause, it cannot also be specified in the IGNORE clause. • If an XML Schema is present in the right side of a schema pair in the MAP clause, it will not be subsequently ignored if listed in the IGNORE clause. • An XML Schema cannot be mapped more than once. It cannot appear on the left side of more than one schema pair. • Schema mappings in the MAP clause are non-transitive. For example, assume schema custxsd3 is mapped to schema custxsd2, and assume a second pair maps schema custxsd2 to schema custxsd1; then schema custxsd1 will not be used instead of schema custxsd3. 17.7.5 Validation Based on schemaLocation Attributes The IMPORT command in Figure 17.30 contains the clause XMLVALIDATE USING SCHEMALOCATION HINTS. This clause indicates that each XML document in the input file is to be validated against the XML Schema that is referenced by the optional xsi:schemaLocation attribute within the document. An xsi:schemaLocation attribute, which is also called a schema location hint, contains a pair of target namespace and schema location. This pair can identify an XML Schema that you have previously registered in the XML Schema Repository. Earlier in this chapter, Figure 17.2 showed an XML document with an xsi:schemaLocation attribute. IMPORT FROM c:\xml\load_customer.txt OF DEL XML FROM c:\xmldata XMLVALIDATE USING SCHEMALOCATION HINTS INSERT INTO customer Figure 17.30 Validation with schema location hints 17.8 CHECKING WHETHER AN EXISTING DOCUMENT HAS BEEN VALIDATED DB2 allows you to check whether an XML document that is stored in a table has previously been validated. This can be done in a couple of ways. In DB2 for Linux, UNIX, and Windows you can use the IS VALIDATED predicate, which works similarly to the IS NULL predicate that you might already be familiar with. The query in Figure 17.31 checks every XML document in the info column of the customer table and returns YES if the document has been validated, and NO otherwise.
17.9 Validating Existing Documents in a Table 535 SELECT id, CASE WHEN info IS VALIDATED THEN 'YES' ELSE 'NO' END AS isvalid FROM customer Figure 17.31 Checking which documents in a table have been validated The query in Figure 17.32 is very similar but uses a WHERE clause with an XMLEXISTS predicate to check the validation status only of the document(s) where the customer name is Matt Foreman. SELECT CASE WHEN info IS VALIDATED THEN 'YES' ELSE 'NO' END AS isvalid FROM customer WHERE XMLEXISTS('$INFO/customerinfo[name = "Matt Foreman"]') Figure 17.32 Checking whether a specific document has been validated To perform similar checks in DB2 for z/OS you need to maintain an additional column in your user table. The column can contain 0 or 1 to indicate whether the document has been validated. Alternatively you can store the OBJECTID of the XML Schema in a BIGINT column. Then you can easily query this column to determine which schema a given XML document belongs to. 17.9 VALIDATING EXISTING DOCUMENTS IN A TABLE You might encounter a situation where you already have XML documents stored in an XML column and want to validate them against an XML Schema. Maybe they were never validated and you want to validate them now. Or, maybe they had been validated when they were inserted, but now you want to validate them against a new schema. Either way, the validation of existing documents can be achieved with SELECT or UPDATE statements. Let’s look at the update process first. Figure 17.33 shows an UPDATE statement that replaces a document with a validated copy of itself. The WHERE clause uses a relational predicate to identify a single row in the customer table. In this row, the XML document in the info column is replaced with the result of the XMLVALIDATE function. The XMLVALIDATE function itself also takes the info column as input. If the document is not valid against the specified XML Schema, the update fails. Otherwise the document is replaced with itself and the OBJECTID of the XML Schema gets attached to the document. This links the document to its schema. The function XMLXSROBJECTID can take the document or any part of it as input, and returns the OBJECTID of the schema that the document was validated against (see section 17.10).
536 Chapter 17 Validating XML Documents against XML Schemas UPDATE customer SET info = XMLVALIDATE(info ACCORDING TO XMLSCHEMA ID db2admin.custxsd) WHERE id = 1000 Figure 17.33 Validating an existing document The UPDATE statement in Figure 17.34 is similar to that in Figure 17.33, but has a different predicate in the WHERE clause. It tries to validate all documents in the XML column that have not been validated before. This update works as expected if all those documents are valid against the specified XML Schema. However, the problem with this UPDATE statement is that it fails and rolls back as soon as the first invalid document is encountered. The reason for this behavior is that the SQL/XML standard requires the XMLVALIDATE function to raise an error if validation fails. You will see later how error handling in a stored procedure can circumvent this problem (see Figure 17.38). UPDATE customer SET info = XMLVALIDATE(info ACCORDING TO XMLSCHEMA ID db2admin.custxsd) WHERE info IS NOT VALIDATED Figure 17.34 Validating multiple existing documents Beware that a bulk update with validation of a large number of documents can take a significant amount of time. All affected documents are rewritten in the table space and logged. If you are only interested in a Yes/No answer whether certain documents are valid for a given schema, and if you don’t require the relationship between documents and schema to be permanently recorded in the database, then a SELECT statement can be used instead of an UPDATE statement. The query in Figure 17.35 reads XML documents from the info column for all customers whose city is Toronto. At the same time it uses the XMLVALIDATE function in the SELECT clause to validate the documents upon retrieval. The query fails at runtime as soon as one document is retrieved that is not valid for the specified schema. SELECT XMLVALIDATE(info ACCORDING TO XMLSCHEMA ID db2admin.custxsd) FROM customer WHERE XMLEXISTS('$INFO/customerinfo[addr/city = "Toronto"]') Figure 17.35 Retrieving and validating documents at the same time If the validation is performed in a stored procedure, an exception handler can catch and handle the validation failure. Figure 17.36 shows a simple stored procedure that takes a single XML document as input and returns 1 if the document is valid and 0 if it is not valid. If the input document
17.9 Validating Existing Documents in a Table 537 is not valid for the specified schema, the exit handler catches the error that is raised by XMLVALIDATE and sets the output parameter isvalid to 0. CREATE PROCEDURE validate(IN doc XML, OUT isvalid INTEGER) LANGUAGE SQL BEGIN DECLARE INVALID_DOCUMENT CONDITION FOR '2200M'; DECLARE EXIT HANDLER FOR INVALID_DOCUMENT SET isvalid = 0; IF (XMLVALIDATE(doc ACCORDING TO XMLSCHEMA ID db2admin.custxsd) IS VALIDATED) THEN SET isvalid = 1; END IF; END # Figure 17.36 Stored procedure to validate an existing document The stored procedure in Figure 17.36 can be called from an application or from other stored procedures that manipulate XML documents. You can also call it in the DB2 Command Line Processor, if the first parameter of the stored procedure call is a query that produces a single XML document. This is illustrated in Figure 17.37, where the XML document with id = 1003 from the customer table is passed to the stored procedure for validation. The output shows that the output parameter isvalid has the value 1, which means that the document is valid. db2 => call validate((SELECT info FROM customer WHERE id = 1003),?) Value of output parameters -------------------------Parameter Name : ISVALID Parameter Value : 1 Return Status = 0 db2 => Figure 17.37 Testing the validation stored procedure in the CLP The stored procedure in Figure 17.38 is designed to perform the same task as the UPDATE statement in Figure 17.34. That is, it validates all documents in the XML column that have not been validated before. The major difference is that this stored procedure does not fail and abort when the first invalid document is encountered. Instead, it loops over the XML documents and uses a CONTINUE handler to count invalid documents instead of raising an error. Alternatively, you could change the CONTINUE handler to write the id values of the invalid documents to a separate table, or take any other appropriate action.
538 Chapter 17 Validating XML Documents against XML Schemas CREATE PROCEDURE bulkvalidate(OUT num_invalid_docs INTEGER) LANGUAGE SQL BEGIN DECLARE count INTEGER DEFAULT 0; DECLARE INVALID_DOCUMENT CONDITION FOR '2200M'; DECLARE CONTINUE HANDLER FOR INVALID_DOCUMENT SET count = count + 1; FOR doc AS cur1 CURSOR FOR SELECT id, info FROM customer WHERE info IS NOT VALIDATED FOR UPDATE OF INFO DO UPDATE customer SET info = XMLVALIDATE(info ACCORDING TO XMLSCHEMA ID db2admin.custxsd) WHERE CURRENT of cur1; END FOR; SET num_invalid_docs = count; END# Figure 17.38 17.10 Stored procedure to validate multiple existing documents FINDING THE XML SCHEMA FOR A VALIDATED DOCUMENT DB2 for Linux, UNIX, and Windows also allows you to determine which XML Schema was used to validate a particular XML document. Every XML Schema that is registered in DB2 is assigned an internal identification number of type BIGINT. You can see this number in the column OBJECTID of the catalog view SYSCAT.XSROBJECTS. Whenever an XML document is validated against an XML Schema, the unique identifier (OBJECTID) is stored with the XML document. The scalar function XMLXSROBJECTID takes an XML document as input and returns the OBJECTID of the XML Schema that was used to validate the XML document. If the input document hasn’t been validated, the value 0 is returned. There are several interesting uses of the function XMLXSROBJECTID. One is to find the XML Schema that was used to validate a specific document. Another is finding all documents that have been validated against a particular XML Schema. Figure 17.39 shows how to use the function XMLXSROBJECTID in the WHERE clause of an SQL statement to join with the OBJECTID column in the catalog view syscat.xsrobjects. Together with the predicate on the relational id column, this retrieves information about the schema that was used to validate the document with id 1003. Instead of the relational predicate you can certainly also use an XMLEXISTS predicate to qualify one or multiple XML documents based on the contents of the XML document itself.
17.10 Finding the XML Schema for a Validated Document 539 SELECT c.id, SUBSTR(x.objectschema,1,10) AS xmlschema_schema, SUBSTR(x.objectname,1,10) AS xmlschema_name FROM customer c, syscat.xsrobjects x WHERE XMLXSROBJECTID(c.info) = x.OBJECTID AND c.id = 1003; ID XMLSCHEMA_SCHEMA XMLSCHEMA_NAME --------------- ---------------- -------------1003 DB2ADMIN CUSTXSD Figure 17.39 Finding schema information for a given XML document There is no hard dependency between a document and the XML Schema it was validated against. This means that an XML Schema can be dropped from the XML Schema Repository even if the database contains documents that were validated against this schema. Those documents continue to carry the OBJECTID of the XML Schema even after the schema is dropped.The OBJECTID now points to a non-existing XML Schema, which has no impact other than the obvious; that is, you won’t find the schema that belongs to these documents. NOTE While the query in Figure 17.39 finds the XML Schema for a given document, the query in Figure 17.40 finds the documents that were validated with a given XML Schema. Again, the function XMLXSROBJECTID facilitates the join between the customer table and the XML Schema Repository. The second and the third predicates select the particular XML Schema db2admin. custxsd for which the query finds all corresponding XML documents. SELECT c.id FROM customer c, syscat.xsrobjects x WHERE XMLXSROBJECTID(c.info) = x.OBJECTID AND x.objectschema = 'DB2ADMIN' AND x.objectname = 'CUSTXSD' Figure 17.40 Finding documents for given XML Schema, using XMLXSROBJECTID Since DB2 9.5 for Linux, UNIX, and Windows you can also use the IS VALIDATED predicate with the ACCORDING TO clause, as shown in Figure 17.41. SELECT c.id FROM customer c WHERE c.info IS VALIDATED ACCORDING TO XMLSCHEMA ID db2admin.custxsd Figure 17.41 Finding documents for given XML Schema, using IS VALIDATED
540 Chapter 17 Validating XML Documents against XML Schemas If you use multiple XML Schemas to validate documents within a single XML column, and if you frequently need to run queries that relate documents to schemas, consider storing the OBJECTID in an additional column of your table with an index on it. This additional column can greatly improve the performance of finding schemas and documents that relate to each other. In DB2 for z/OS, such an extra column is the only way to correlate documents to schemas. 17.11 HOW TO UNDO DOCUMENT VALIDATION It is possible to make a validated XML document look and behave as if it had never been validated. When you “undo” the validation, the linkage between the document and any XML Schema is removed, because the OBJECTID of an XML Schema is no longer associated with the document. All it takes is to update the validated document with itself and reparse it without validation. You will probably rarely have to do this, but we want to show that it is possible if needed. It only applies to DB2 for Linux, UNIX, and Windows. You “remove validation” from a document with an UPDATE statement and the XMLSERIALIZE and XMLPARSE functions as shown in Figure 17.42. This statement serializes the stored document tree back to text format and then parses it again to produce DB2’s internal tree format, but without validation (assuming you don’t have triggers that enforce validation). The document now looks like it has never been validated. UPDATE customer SET info = XMLPARSE(DOCUMENT XMLSERIALIZE(info AS CLOB(5000))) WHERE id = 1000 Figure 17.42 Undoing validation disassociates a document from its schema Note that the XMLSERIALIZE function requires you to use a character type, such as VARCHAR or CLOB, that is large enough to temporarily hold the serialized document. 17.12 CONSIDERATIONS FOR VALIDATION IN DB2 FOR Z/OS Throughout this chapter you have seen many ways in which the function XMLVALIDATE can be used in DB2 for Linux, UNIX, and Windows to validate XML documents against an XML Schema. The equivalent function in DB2 9 for z/OS is called SYSFUN.DSN_XMLVALIDATE. The main difference between the two is that DSN_XMLVALIDATE must be an argument to the XMLPARSE function. The other difference is that DSN_XMLVALIDATE does not use an ACCORDING TO XMLSCHEMA clause to identify an XML Schema, but a regular parameter instead. The following sections provide examples.
17.12 Considerations for Validation in DB2 for z/OS 17.12.1 541 Document Validation Upon Insert The DSN_XMLVALIDATE function can take either two or three input parameters. The first parameter is the XML document that you want to validate. It must be of type CLOB or BLOB with a maximum size of 250MB, or of type VARCHAR with a maximum size of 32KB. If you are using DSN_XMLVALIDATE with two parameters, then the second parameter has to be the SQL identifier of the XML Schema that you want to use for validation. This parameter cannot be NULL. Figure 17.43 shows two INSERT statements that use DSN_XMLVALIDATE with two parameters. The first statement provides the XML document as a parameter marker, and the second uses a host variable. Both specify that the document is to be validated against the XML Schema SYSXSR.CUSTXSD. An error is returned if an XML Schema with this identifier is not found in DB2’s XML Schema Repository (XSR). INSERT INTO customer(id, info) VALUES (?, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( (CAST ? AS CLOB), 'SYSXSR.CUSTXSD') ) ); INSERT INTO customer(id, info) VALUES (:id, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, 'SYSXSR.CUSTXSD') ) ); Figure 17.43 Referencing the XML Schema by its SQL identifier If you are using DSN_XMLVALIDATE with three parameters, then the second and third parameters must be the target namespace and the schema location of the XML Schema that you want to use for validation (see Figure 17.44). This combination of target namespace and schema location must uniquely identify an XML Schema that is registered in the XSR, otherwise an error is raised. If you use DSN_XMLVALIDATE with three parameters, the second and/or the third parameter can be NULL. In this case DB2 still looks for a corresponding XML Schema in its XML Schema Repository. If both parameters are NULL, DB2 expects to find exactly one schema in the XSR whose target namespace and schema location are NULL. DB2 for z/OS does not infer the schema from a schema location attribute inside the XML document that you want to validate. INSERT INTO customer(id, info) VALUES (?, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( (CAST ? AS CLOB), 'http://pureXMLcookbook.org', NULL ) ) ); INSERT INTO customer(id, info) VALUES (?, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, 'http://pureXMLcookbook.org', 'customer.xsd' ) ) ); Figure 17.44 (continues) Referencing the XML Schema by target namespace and schema location
542 Chapter 17 Validating XML Documents against XML Schemas INSERT INTO customer(id, info) VALUES (:id, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, NULL, 'customer.xsd' ) ) ); INSERT INTO customer(id, info) VALUES (:id, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, NULL, NULL ) ) ); Figure 17.44 (Continued) Referencing the XML Schema by target namespace and schema location The previous examples provided either the SQL identifier of the XML Schema, or the target namespace and schema location as string literals. Alternatively you can provide them through parameter markers or host variables. The first INSERT statement in Figure 17.45 uses the DSN_XMLVALIDATE function with two parameter markers. The first provides the document to validate and the second provides the SQL identifier of the XML Schema. The second parameter cannot provide an actual XML Schema document for validation, because DB2 only validates against schemas that were previously registered in the XSR. The second INSERT statement in Figure 17.45 uses DSN_XMLVALIDATE with three host variables, which means that the schema is being identified by target namespace and schema location. INSERT INTO customer(id, info) VALUES (?, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( (CAST ? as CLOB), ?) ) ); INSERT INTO customer(id, info) VALUES (:id, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, :tgtnamespace_hv, :schemalocation_hv) ) ); Figure 17.45 Providing schema identification via parameter markers or host variables The DSN_XMLVALIDATE function can only be used as a parameter to the XMLPARSE function, and in that case the XMLPARSE function cannot use the PRESERVE WHITESPACE clause. Validation always implies that boundary whitespace is stripped, not preserved, in both DB2 for z/OS and DB2 for Linux, UNIX, and Windows. 17.12.2 Document Validation Upon Update If you use SQL UPDATE statements in DB2 for z/OS to replace existing documents, the DSN_XMLVALIDATE function allows you to validate the new document as part of the update
17.12 Considerations for Validation in DB2 for z/OS 543 process. In the previous sections you have seen various different ways in which you can provide input to the DSN_XMLVALIDATE function. All of them work in UPDATE statements as well, as in Figure 17.46. UPDATE customer SET info = XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, 'SYSXSR.CUSTXSD') ) ) WHERE id = 1003 Figure 17.46 17.12.3 DSN_XMLVALIDATE in an UPDATE statement Validating Existing Documents in a Table There may be situations where you already have XML documents stored in an XML column and want to validate them against an XML Schema. For example, the query in Figure 17.47 selects all documents for customers in Toronto and validates them upon retrieval. Remember that the DSN_XMLVALIDATE function requires the input document to be of type CLOB or BLOB. However, the column info in our customer table is of type XML. Therefore, at the time of writing, the function XMLSERIALIZE is required to convert the XML documents to type CLOB or BLOB. SELECT XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( XMLSERIALIZE(info AS CLOB), 'SYSXSR.CUSTXSD') ) ) FROM customer WHERE XMLEXISTS('$i/customerinfo/addr[city = "Toronto"]' PASSING info AS "i"); Figure 17.47 Validating existing documents in a table The query in Figure 17.47 parses and validates all matching documents, which requires more CPU cycles than simply retrieving the documents without reparsing them. The query raises an error as soon as one document is encountered that is not valid against the schema SYSXSR.CUSTXSD. You can capture and handle this error in a stored procedure, similar to how it is discussed in section 17.9. 17.12.4 Summary of Platform Similarities and Differences Table 17.2 provides a summary of the differences in validation functionality between DB2 for z/OS and DB2 for Linux, UNIX, and Windows. This comparison is a point-in-time snapshot and subject to change. Over time, the supported features in the DB2 for z/OS and DB2 for Linux, UNIX, and Windows continue to converge.
544 Table 17.2 Chapter 17 Validating XML Documents against XML Schemas Summary of Platform Similarities and Differences Feature DB2 for Linux, UNIX, and Windows DB2 for z/OS Document validation for INSERT and UPDATE operations Yes Yes Validation function XMLVALIDATE DSN_XMLVALIDATE; always has to be an argument of the XMLPARSE function. Can reference XML Schema by its SQL identifier Yes Yes Can reference XML Schema by target namespace and schema location Yes Yes Can validate existing documents in a table Yes Yes Can perform validation in stored procedures Yes Yes Validation support in the LOAD utility Yes You can validate documents after LOAD. Link between documents and schemas is stored with each validated document Yes* You can maintain this information in a separate column of the user table. IS VALIDATED predicate to Yes* You can get this information from a separate column in the user table where you record the schema ID for each document. check whether a document has been validated Function XMLXSROBJECTID to find documents for a given schema, or vice versa Yes* *If you query the relationship between documents and schemas often, you might want to maintain this information (the schema ID for any given document) in a separate column that is indexed to ensure good performance. 17.13 SUMMARY Validating XML documents against XML Schemas is the best way to enforce XML data quality in the database. However, document validation is optional in DB2 and there is no performance or functional penalty if you don’t use an XML Schema. If you choose to validate documents, you typically do so when you insert, update, or load them. Existing documents in the database can
17.13 Summary 545 also be validated in queries. An XML column can contain a mix of validated and non-validated documents, and different documents in a column can be validated with different schemas. In DB2 you are not forced to assign a single XML Schema to an entire XML column. There are two general approaches for document validation in DB2: • Application-centric: Applications use the XMLVALIDATE (or DSN_XMLVALIDATE) function in their INSERT and UPDATE statements. This makes validation a distributed responsibility and provides maximum flexibility. • Database-centric: The database uses triggers and check constraints to enforce validation on a per-XML-column basis. These application- and database-centric techniques can also be combined to implement a custom validation strategy that meets specific requirements.
This page intentionally left blank
C H A P T E R 18 Using XML in Stored Procedures, UDFs, and Triggers tored procedures, user-defined functions (UDFs), and triggers are database objects that encapsulate processing steps to retrieve or manipulate data in the database. They can contain multiple statements that are invoked and executed as a single unit. They are typically used to implement application-specific logic. Stored procedures and UDFs can be implemented in the SQL Procedure Language (SQL PL) or in external languages such as Java, C, or COBOL. The benefits of stored procedures and UDFs include: S • Reduced coding labor due to the creation of reusable processing modules • Richer processing capabilities in the databases by defining custom logic and functions • Improved performance and reduced network traffic because stored procedures and UDFs are executed close to the data; that is, in the database engine Stored procedures are executed with CALL statements, which can be issued from an application program, from another stored procedure, from a UDF, or from a trigger. UDFs are used in SQL statements just like you use predefined SQL functions. Triggers are executed automatically when an insert, delete, or update operation happens on a specified table. Triggers are used to implement automated reactions to data modifications and to enforce data integrity rules within the database. The benefits of stored procedures, UDFs, and triggers apply equally to the processing of XML data and relational data. In this chapter we discuss the following topics: • Manipulating XML data in stored procedures (section 18.1) • Manipulating XML data in user-defined functions (section 18.2) • Manipulating XML data in triggers (section 18.3) 547
548 Chapter 18 Using XML in Stored Procedures, UDFs, and Triggers For general background on stored procedures, UDFs, triggers, and the SQL Procedure Language, please consult the resources listed in the Appendix C, Further Reading. 18.1 MANIPULATING XML IN SQL STORED PROCEDURES Stored procedures are a powerful tool for application development. They allow you to define simple or complex multi-statement operations and processing logic that can be invoked with a single call from the application. Stored procedures can encapsulate and hide complex data manipulation from the client application. Since stored procedures are executed in the database server, they can process data without moving it to the client, which is often beneficial for performance. In previous chapters you have already seen several examples where stored procedures implement specific tasks: • Section 7.7, Figure 7.41: Stored procedure to execute XPath dynamically • Section 17.3, Figure 17.7: Stored procedure to handle and record validation errors • Section 17.9, Figure 17.36: Stored procedure to validate an existing document • Section 17.9, Figure 17.38: Stored procedure to validate multiple existing documents DB2 for Linux, UNIX, and Windows allows you to use the XML data type not just to define columns in a table, but also to declare input and output parameters as well as variables in stored procedures and user-defined functions. Stored procedures can therefore manipulate XML documents in their parsed format without incurring additional XML parsing, which is a major performance benefit. Variables of data type XML can be manipulated in stored procedures much like variables of other types. For example, XML variables can receive their value through statements such as a SET statement or a SELECT INTO statement. The only restriction is that XML variables and XML input parameters lose their value upon a COMMIT or ROLLBACK operation. If you want to use an XML variable or parameter after a ROLLBACK or COMMIT statement, you need to assign new values to them first. Otherwise error SQL1354N is raised. The best way to use XPath or XQuery expressions in stored procedures is to embed them in the SQL/XML functions XMLQUERY, XMLTABLE, or XMLEXISTS. These can be used in stored procedure statements and accept variables of type XML in their PASSING clause. You can also use XQuery without SQL in stored procedures, but only with dynamic cursors. Static XQuery is not allowed. 18.1.1 Basic XML Manipulation in Stored Procedures Let’s look at Figure 18.1 to become familiar with the basic capabilities of handling XML data in stored procedures. The table addrtable is defined in addition to the customer table that we
18.1 Manipulating XML in SQL Stored Procedures 549 have been using. The stored procedure has one input parameter and one output parameter, both are of type XML. Additionally, the procedure declares the variables id and address of type INTEGER and XML, respectively. The first SET statement extracts the Cid attribute from the input document, converts it to INTEGER, and assigns it to the variable id. Note that the input parameter custDoc is passed into the XMLQUERY function. Next is the SELECT-INTO statement, which demonstrates two important capabilities. First, the INTO clause is used to assign an XML value to the XML output parameter olddoc. Second, the variable id is passed into the XMLEXISTS predicate so that only the matching document is retrieved from the customer table. The last part of the stored procedure shows that you can use the XMLEXISTS predicate directly in an IF statement. It checks whether the address in the input document is in Canada. If this is true then the SET statement extracts the addr element of the document and assigns it to the XML variable address. Subsequently the address and the id variables are inserted into the table addrtable. CREATE TABLE addrtable(id INTEGER, addr XML)# CREATE PROCEDURE processDoc(IN custDoc XML, OUT oldDoc XML) BEGIN ATOMIC DECLARE id INTEGER; DECLARE address XML; SET id = XMLCAST(XMLQUERY('$d/customerinfo/@Cid' PASSING custDoc AS "d") as INTEGER); SELECT info INTO olddoc FROM customer WHERE XMLEXISTS('$INFO/customerinfo[@Cid = $x]' PASSING id AS "x"); IF XMLEXISTS('$d/customerinfo/addr[@country = "Canada"]' PASSING custDoc AS "d") THEN SET address = XMLQUERY('$d/customerinfo/addr' PASSING custDoc AS "d"); INSERT INTO addrtable(id, addr) VALUES(id, XMLDOCUMENT(address)); END IF; END # Figure 18.1 Stored procedure with basic XML manipulation Since the body of a stored procedure can contain multiple statements, these statements have to be separated by the semicolon character. This use of the semicolon conflicts with the fact that the semicolon is also the default terminating character for statements in the DB2 Command Line Processor (CLP). The same applies to user-defined functions and triggers. To avoid problems you need to use a different terminating character in the CLP. For example, in Figure 18.1 the # is used as the terminating character for the CREATE PROCEDURE statement. You must invoke the CLP
550 Chapter 18 Using XML in Stored Procedures, UDFs, and Triggers with the td# option to set the #, or any other character of your choosing, as the statement terminator. If the CREATE PROCEDURE statement in Figure 18.1 is in a file create_proc.sql then the following command issued at the OS prompt creates the procedure: db2 -td# -f create_proc.sql 18.1.2 A Stored Procedure to Store XML in a Hybrid Manner Let’s look at a common use case for a stored procedure. Assume you want to store the customer sample documents in a hybrid fashion. You might decide to keep the address information as XML, because you expect it to be of variable format over time, but you want to store customer name and phone information in relational columns. Since each customer can have multiple phone numbers (one-to-many relationship), the phone numbers have to be stored in a separate table with a proper join key. That join key can be a number generated by a sequence for each new XML document that comes in. A sequence is a database object that produces a stream of unique values. Figure 18.2 shows the definition of the target tables and the sequence. CREATE TABLE cust (id INTEGER, name VARCHAR(20), addr XML); CREATE TABLE phone(id INTEGER, type VARCHAR(20), number VARCHAR(20)); CREATE SEQUENCE id_seq START WITH 1 INCREMENT BY 1 CACHE 100; Figure 18.2 Table and sequence definition for hybrid storage The stored procedure in Figure 18.3 takes a customer XML document as an input parameter. Note that this parameter is of type XML. Each time the procedure is called, it uses the NEXTVAL expression to pull a new id value from the sequence. Then it uses two INSERT statements with XMLTABLE functions to extract the required values for insert into the target tables cust and phone. The first insert produces one row per customer, the second produces one row per phone element. The same id value is used for inserts into both tables to ensure referential integrity. Instead of using the sequence, the id could also be passed as a parameter from the calling application, or extracted from the document. CREATE PROCEDURE insertCustomer(IN custDoc XML, OUT id INTEGER) BEGIN ATOMIC SET id = NEXTVAL FOR id_seq; INSERT INTO cust(id, name, addr) SELECT id, T.name, T.address FROM XMLTABLE('$d/customerinfo' PASSING custDoc AS "d" COLUMNS name VARCHAR(20) PATH 'name', address XML PATH 'document{addr}' ) as T; Figure 18.3 Stored procedure for hybrid XML inserts
18.1 Manipulating XML in SQL Stored Procedures 551 INSERT INTO phone (id, type, number) SELECT id, T.type, T.num FROM XMLTABLE('$d/customerinfo/phone' PASSING custDoc AS "d" COLUMNS type VARCHAR(20) PATH '@type', num VARCHAR(20) PATH '.') AS T; END # Figure 18.3 Stored procedure for hybrid XML inserts (Continued) With the stored procedure in Figure 18.3 in place, an application should use the stored procedure call call insertCustomer(?) to insert new customer documents and never use direct INSERT statements. If all inserts are performed through this stored procedure, the relational and XML data in the tables are always consistent. You can have similar stored procedures for update and delete operations. The stored procedures can also contain additional business logic or data manipulation. A challenging situation occurs when the stored procedure in Figure 18.3 fails with the following error message, where <value> is a data value in the input document that cannot be cast to the data type VARCHAR(20): SQL16061N The value <value> cannot be constructed as, or cast (using an implicit or explicit cast) to the data type "VARCHAR_20". Error QName=err:FORG0001. SQLSTATE=10608. Note that the XMLTABLE functions in the stored procedure cast the customer name, phone type, and phone number to VARCHAR(20). However, the error message does not specify which one of them caused the problem. In this simple example, a quick look at the <value> might reveal which XML element or attribute caused the error. However, in more complex cases it is often difficult to identify which element or attribute is responsible for the error. The solution is to add code to the stored procedure to catch the SQL error, obtain the offending <value>, look for it in the input document, and return the name of the XML element or attribute that caused the problem. This logic is coded in Figure 18.4. The INSERT statements in the procedure in Figure 18.4 are the same as previously in Figure 18.3. The difference in Figure 18.4 is the error handling. The procedure declares SQLSTATE 10680 as a condition, and an exit handler to take appropriate action when this condition occurs. This action is enclosed in a separate BEGIN-END block and only executed when the declared error happens. The exit handler obtains the error information and uses the SUBSTR function to extract the offending <value> and data type from it. Then it uses the XQuery expression $d//(*,@*) [data(.) = $v]/local-name() to obtain the name of the element or attribute that contains the offending value. In this expression, $d represents the XML document and $v the value to
552 Chapter 18 Using XML in Stored Procedures, UDFs, and Triggers look for. The first part of the expression, $d//(*,@*), iterates over all elements and attributes in the document. For each of those, the predicate [data(.) = $v] checks whether the value of the element or attribute matches the <value> from the error message. If the predicate is true, then the last step of the expression, /local-name(), obtains the name of the element or attribute. The whole expression is an argument of the function string-join, which produces a comma-separated list in case more than one node with the matching value is found in the document. CREATE PROCEDURE insertCustomer(IN custDoc XML, OUT id INTEGER, OUT MESSAGE_TEXT VARCHAR(300)) BEGIN ATOMIC DECLARE vErrMsg VARCHAR(300); DECLARE vValue VARCHAR(100); DECLARE vNode VARCHAR(100); DECLARE vType VARCHAR(100); DECLARE vTokenString VARCHAR(100); DECLARE XMLTABLE_CAST_FAILURE CONDITION FOR SQLSTATE '10608'; DECLARE EXIT HANDLER FOR XMLTABLE_CAST_FAILURE BEGIN -- retrieve error message and token string GET DIAGNOSTICS EXCEPTION 1 vTokenString = DB2_TOKEN_STRING, vErrMsg = MESSAGE_TEXT; SET vValue = SUBSTR(vErrMsg, 23, POSSTR(vErrMsg, '" ')-23); SET vType = SUBSTR(vTokenString, LENGTH(vValue)+2); -- find xml nodes whose values match the error token SET vNode = XMLCAST(XMLQUERY(' string-join($d//(*,@*)[data(.) = $v]/local-name(),",")' PASSING custDoc AS "d", vValue AS "v") AS VARCHAR(100)); -- create message text SET MESSAGE_TEXT = 'Failed to cast the value "' || vValue || '", at element or attribute "' || vNode || '", to type "' || vType || '".'; END ; SET id = NEXTVAL FOR id_seq; INSERT INTO cust(id, name, addr) SELECT id, T.name, T.address FROM XMLTABLE('$d/customerinfo' PASSING custDoc AS "d" COLUMNS name VARCHAR(20) PATH 'name', address XML PATH 'document{addr}' ) as T; Figure 18.4 Stored procedure for hybrid XML inserts with error handling
18.1 Manipulating XML in SQL Stored Procedures 553 INSERT INTO phone (id, type, number) SELECT id, T.type, T.num FROM XMLTABLE('$d/customerinfo/phone' PASSING custDoc AS "d" COLUMNS type VARCHAR(20) PATH '@type', num VARCHAR(20) PATH '.') AS T; SET MESSAGE_TEXT = 'Insert successful.'; END # Figure 18.4 18.1.3 Stored procedure for hybrid XML inserts with error handling (Continued) Loops and Cursors The example in Figure 18.5 shows that you can easily loop over the elements and attributes from one or multiple XML documents. The stored procedure takes an XML document as input and uses a SELECT statement with an XMLTABLE function to produce one row for each phone element. The FOR statement is used to iterate over these rows. When a FOR statement is executed, a cursor is implicitly declared such that each iteration of the FOR loop fetches the next row from the result set until there are no rows left. For each row, the statements in the DO clause of the FOR statement are executed. An IF-THEN-ELSE statement inserts the phone information into the table cellphones if the phone type is cell, and into the table landlines otherwise. To keep stored procedures simple, we recommend the use of FOR statements instead of explicit cursor declarations whenever possible. CREATE TABLE cellphones(id INTEGER, number VARCHAR(20))# CREATE TABLE landlines(id INTEGER, number VARCHAR(20))# CREATE PROCEDURE processPhones(IN custDoc XML) BEGIN ATOMIC FOR phone AS SELECT T.id, T.type, T.num FROM XMLTABLE('$d/customerinfo/phone' PASSING custDoc AS "d" COLUMNS id INTEGER PATH '../@Cid', type VARCHAR(5) PATH '@type', num VARCHAR(20) PATH '.') as T DO IF phone.type='cell' THEN INSERT INTO cellphones(id,number) VALUES(phone.id, phone.num); ELSE INSERT INTO landlines(id, number) VALUES(phone.id, phone.num); END IF; END FOR; END # Figure 18.5 FOR loop over repeating XML elements
554 Chapter 18 Using XML in Stored Procedures, UDFs, and Triggers You can also use XQuery without SQL in stored procedures, but not in a FOR statement or any static manner. You have to construct the XQuery dynamically as a string and prepare and open it as a dynamic cursor. In Figure 18.5 an XQuery string is assigned to the variable xqr. Note that the query string includes the value of the input parameter city. The query is then prepared and opened as a CURSOR WITH RETURN TO CALLER. With this cursor definition, the result sequence of the XQuery becomes the result set of the stored procedure. The procedure does not fetch from or close the cursor, which allows the calling application to iterate over the result of the query. Alternatively you could decide to have a WHILE loop with a FETCH statement in the stored procedure itself to process the result set. CREATE PROCEDURE cityphones(IN city VARCHAR(20)) BEGIN ATOMIC DECLARE xqr VARCHAR(2048); DECLARE c1 CURSOR WITH RETURN TO CALLER FOR stmt; SET xqr = 'xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO") where $i/customerinfo/addr[city="'|| city ||'"] return $i/customerinfo/phone'; PREPARE stmt FROM xqr; OPEN c1; END # Figure 18.6 18.1.4 Dynamic cursor for an XQuery A Stored Procedure to Update a Selected XML Element or Attribute The stored procedure in Figure 18.7 changes the value of a selected XML node in a document. The input parameters to the procedure are an XML document, the path to the node that is to be updated, and the new value of the node. The parameter for the XML document is declared as INOUT, so that the updated document is returned. The procedure constructs an XQuery update expression in an XMLQUERY function. The input parameter xpath provides the target path for the replace clause. Additionally, the document and the new value are passed as parameters into the XQuery Update expression. The statement OPEN c1 USING mydoc, value binds the procedure parameters mydoc and value to the parameters markers in the XMLQUERY function. CREATE PROCEDURE updateXPath (INOUT mydoc XML, IN xpath VARCHAR(1024), IN value VARCHAR(128)) BEGIN ATOMIC DECLARE sql VARCHAR(2048); DECLARE c1 CURSOR FOR stmt; SET sql = 'VALUES XMLQUERY('' copy $new := $original modify do replace value of $new' || xpath ||' Figure 18.7 Stored procedure to update a selected XML element or attribute
18.1 Manipulating XML in SQL Stored Procedures 555 with $value return $new '' PASSING XMLCAST(? AS XML) AS "original", CAST(? AS VARCHAR(1024)) AS "value") '; PREPARE stmt FROM sql; OPEN c1 USING mydoc, value; FETCH c1 INTO mydoc; CLOSE c1; END # Figure 18.7 18.1.5 Stored procedure to update a selected XML element or attribute (Continued) Three Tips for Testing Stored Procedures The following three tips seem to be not as widely known as they should be, but they are extremely useful. Tip 1: How to Test Stored Procedures in the CLP It is often very useful to test stored procedures in the CLP without having to have application code that calls the procedure and passes an XML document as input. You can simply import your test documents into a DB2 table, such as testdocs, and use an SQL fullselect as the input parameter in the stored procedure call in the CLP. Make sure that the fullselect produces exactly one row with one column of type XML, as shown in Figure 18.8. The second parameter is a question mark as a placeholder for the output parameter oldDoc. CREATE TABLE testdocs(id INTEGER NOT NULL PRIMARY KEY, doc XML); IMPORT FROM testdata.del OF DEL INSERT INTO testdocs; CALL processDoc( (SELECT doc FROM testdocs WHERE id = 3),? ); Figure 18.8 Testing a stored procedure Tip 2: How to Get the Execution Plan of a Stored Procedure If a stored procedure does not perform well then it can be useful to examine the execution plans of queries or other statements in the stored procedure. One approach is to copy individual statements from the stored procedure and to explain them separately. However, it can happen that a statement has a different execution plan when it is compiled in the context of a stored procedure than when it is compiled by itself. In DB2 for Linux, UNIX, and Windows you can use the following approach to explain the statements within a stored procedure. 1. Establish a connection to the database. 2. Create explain tables if they do not already exist (see section 14.1.1, The Explain Tables in DB2 for Linux, UNIX, and Windows).
556 Chapter 18 Using XML in Stored Procedures, UDFs, and Triggers 3. Issue the following command at the OS prompt to enable the capturing of execution plans when stored procedures are created in the current session: db2 "CALL SYSPROC.SET_ROUTINE_OPTS('EXPLAIN ALL')" 4. If a CREATE PROCEDURE statement is the only statement in a file called create_ proc.sql, and if the statement is terminated with the # character, create the procedure with the following command at the OS prompt: db2 -td# -f create_proc.sql 5. Use the db2exfmt utility to write the execution plan to a file such as myprocplan.txt: db2exfmt -d <dbname> -1 -o myprocplan.txt The output file will contain separate explain information for each statement in the stored procedure. If you want to check whether the capturing of explain information for stored procedures is enabled, use the following SELECT statement: SELECT GET_ROUTINE_OPTS() FROM sysibm.sysdummy1 To revert to not explaining stored procedures, use this statement: db2 "CALL SYSPROC.SET_ROUTINE_OPTS('EXPLAIN NO')" Tip 3: How to Profile a Stored Procedure IBM Data Studio Developer contains a very useful stored procedure profiler that can provide information about the runtime performance of a procedure. For each statement in the stored procedure, the profile reveals the number of executions, the elapsed time, CPU time, and other optional metrics such as the number of rows read or written, or the number of logical and physical page reads. This information is extremely helpful to understand the behavior of a complex stored procedure and to discover which parts of a procedure are particularly expensive to run. If you have a Data Development Project in Data Studio and a stored procedure in the Stored Procedures folder of the Data Project Explorer, right-click on the procedure name and choose Run Profiling. The same context menu also has a command to invoke the stored procedure debugger, which is another helpful tool for the development of stored procedures in DB2 for Linux, UNIX, and Windows, and DB2 for z/OS. 18.2 MANIPULATING XML IN USER-DEFINED FUNCTIONS DB2 9.7 for Linux, UNIX, and Windows allows you to use the XML data type in user-defined functions (UDFs). UDFs can have XML type parameters and variables and can contain SQL/XML statements that manipulate XML data. Most of these capabilities are similar to the XML support in stored procedures. An important difference between UDFs and stored procedures is that UDFs can be used in SQL statements while stored procedures can only be invoked with a CALL statement. In this section we discuss several examples of UDFs that manipulate XML data.
18.2 Manipulating XML in User-Defined Functions 18.2.1 557 A UDF to Extract an Element or Attribute Value The function getname in Figure 18.9 takes an XML document as input and returns a value of type VARCHAR(25). The body of the function consists of a single RETURN statement. It contains the functions XMLCAST and XMLQUERY to extract the name element and convert it to VARCHAR(25). The PASSING clause of the XMLQUERY function passes the function’s input parameter doc into the XPath expression. Below the function you see an SQL statement that invokes the function in its SELECT clause. The use of the UDF allows an application to retrieve customer names without having to code the actual XPath expression and SQL/XML functions. CREATE FUNCTION getname(doc XML) RETURNS VARCHAR(25) LANGUAGE SQL CONTAINS SQL NO EXTERNAL ACTION DETERMINISTIC BEGIN ATOMIC RETURN XMLCAST(XMLQUERY('$d/customerinfo/name' PASSING doc AS "d") AS VARCHAR(25)); END # SELECT getname(info) AS name FROM customer WHERE cid = 1002 # NAME ------------------------Jim Noodle 1 record(s) selected. Figure 18.9 Scalar UDF to extract an element value Such a scalar UDF also enables you to create a table with a generated column whose value is automatically computed based on the XML documents in an XML column: CREATE TABLE custinfo(info XML, name VARCHAR(25) GENERATED ALWAYS AS (getname(info))); The function in Figure 18.9 is a scalar function, which means it returns a single value. If you want to use a similar function to extract a repeating element then a table function instead of a scalar function can be more appropriate. This is shown next. 18.2.2 A UDF to Extract the Values of a Repeating Element Figure 18.10 demonstrates a function that extracts the phone elements from a given document. Since a customer document can have multiple phone elements, the return type of the UDF is a table. This UDF is therefore a table function. The structure of the returned table is defined in the second line of the CREATE FUNCTION statement. The body of the function contains a RETURN statement that includes an SQL/XML query that produces the rows and columns of the result table.
558 Chapter 18 Using XML in Stored Procedures, UDFs, and Triggers Below the function you see an SQL query that uses the UDF. Since this UDF is a table function, it is used in a table expression in the FROM clause of the SELECT statement. The result set of the query includes two columns from the UDF plus the cid column from the customer table. CREATE FUNCTION getphone(doc XML) RETURNS TABLE(type VARCHAR(10), number VARCHAR(20)) BEGIN ATOMIC RETURN SELECT type, number FROM XMLTABLE('$d/customerinfo/phone' PASSING doc AS "d" COLUMNS type VARCHAR(10) PATH '@type', number VARCHAR(20) PATH '.') ; END # SELECT cid, p.type, p.number FROM customer, TABLE(getphone(info)) p WHERE cid = 1004# CID ---------------1004 1004 TYPE ---------work home NUMBER -------------------905-555-4789 416-555-3376 2 record(s) selected. Figure 18.10 Table UDF to extract repeating element values You can certainly use multiple UDFs in a single query, as illustrated by the query in Figure 18.11. SELECT getname(info) AS name, p.type, p.number FROM customer, TABLE(getphone(info)) p WHERE cid IN (1004, 1005) NAME ------------------------Matt Foreman Matt Foreman Larry Menard Larry Menard TYPE ---------work home work home NUMBER -------------------905-555-4789 416-555-3376 905-555-9146 416-555-6121 4 record(s) selected. Figure 18.11 18.2.3 Using a scalar UDF and a table UDF in a query A UDF to Shred XML Data to a Relational Table A table function can also help you shred XML data into a relational table. Suppose you want to populate the following target table:
18.2 Manipulating XML in User-Defined Functions 559 CREATE TABLE address(cid INTEGER, name VARCHAR(30), street VARCHAR(40), city VARCHAR(30)) To shred XML documents into this table, you can create a table function that takes an XML document as input and returns a set of rows with columns that match the target table. Figure 18.12 defines such a function. CREATE FUNCTION extractcols(doc XML) RETURNS TABLE(cid INT, name VARCHAR(30), street VARCHAR(40), city VARCHAR(30)) BEGIN ATOMIC RETURN SELECT x.custid, x.custname, x.str, x.city FROM XMLTABLE('$d/customerinfo' PASSING doc AS "d" COLUMNS custid INTEGER PATH '@Cid', custname VARCHAR(30) PATH 'name', str VARCHAR(40) PATH 'addr/street', city VARCHAR(30) PATH 'addr/city' ) AS x ; END # Figure 18.12 Table function to extract several elements and attributes You can then include this table function in an INSERT-INTO-SELECT-FROM statement. The first INSERT statement in Figure 18.13 reads XML documents from the XML column info of the customer table and shreds them into the address table. The function extractcols takes the XML column info as input and produces relational rows for insert into the target table. The second INSERT statement in Figure 18.13 shreds an XML document that is provided by an application through the parameter marker in the FROM clause. INSERT INTO address(cid, name, street, city) SELECT e.cid, e.name, e.street , e.city FROM customer c, TABLE(extractcols(c.info)) e WHERE c.cid < 1050; INSERT INTO address(cid, name, street, city) SELECT e.cid, e.name, e.street , e.city FROM TABLE(extractcols(cast(? as XML))) e ; Figure 18.13 18.2.4 Using a table function to shred XML documents A UDF to Modify an XML Document Chapter 12, Updating and Transforming XML Documents, describes XQuery Update expressions that allow you to change the value of an element or attribute, or to insert, rename, or delete elements and attributes in a document. It can be convenient to encapsulate such update expressions in a user-defined function, which then serves as a much simpler update interface for database applications.
560 Chapter 18 Using XML in Stored Procedures, UDFs, and Triggers Using the customer documents in the sample database as an example, suppose you want to simplify the task of updating a selected phone element in a document. You could code the UDF in Figure 18.14, which has the following input parameters: • doc: the XML document that is to be updated • phonetype: a string such as “cell” or “work” to indicate which phone is to be updated • number: the new telephone number The function returns the input document where the phone element with the matching type attribute has been given the new value. CREATE FUNCTION updatephone(doc XML, phonetype VARCHAR(8), number VARCHAR(12) ) RETURNS XML BEGIN ATOMIC RETURN XMLQUERY(' copy $new := $p1 modify do replace value of $new/customerinfo/phone[@type=$p2] with $p3 return $new' PASSING doc AS "p1", phonetype as "p2", number as "p3"); END # Figure 18.14 Scalar UDF to modify an XML document If an application wants to change the work phone number of customer 1002 to the new value 408-463-4963, it can simply issue the UPDATE statement in Figure 18.15 and does not need to be concerned with the details of the underlying XQuery Update expression. UPDATE customer SET info = updatephone(info, 'work', '408-463-4963') WHERE cid = 1002 Figure 18.15 UPDATE statement with a scalar UDF Remember that the update expression “replace value of” fails if the target path ($new/customerinfo/phone[@type=$p2]) does not produce exactly one node. In other words, the invocation of the UDF in Figure 18.15 leads to an error if the document for customer 1002 does not contain a phone element whose type attribute has the value work. Therefore you might want to perform an “upsert” operation (update or insert). An “upsert” operation updates the phone element if it exists and inserts a new phone element otherwise. This logic is coded in the UDF in Figure 18.16 with an XQuery if-then-else expression. The else branch constructs a new phone element with a type attribute, and the variables $p2 and $p3 provide the values for this
18.3 Manipulating XML Data with Triggers 561 attribute and element, respectively. Within such attribute and element constructors the variables $p2 and $p3 have to be in curly brackets. CREATE FUNCTION upsert_phone(doc XML, phonetype VARCHAR(8), number VARCHAR(12) ) RETURNS XML BEGIN ATOMIC RETURN XMLQUERY('copy $new := $p1 modify if ($new/customerinfo/phone[@type = $p2]) then do replace value of $new/customerinfo/phone[@type = $p2] with $p3 else do insert <phone type="{$p2}">{$p3}</phone> as last into $new/customerinfo return $new' PASSING doc AS "p1", phonetype as "p2", number as "p3"); END # Figure 18.16 18.3 Scalar UDF to update or insert an XML element (“upsert”) MANIPULATING XML DATA WITH TRIGGERS A trigger defines a set of operations that are performed in response to an INSERT, UPDATE, or DELETE statement on a specified table. For example, a trigger can perform updates to other tables, automatically generate or change values for inserted or updated rows, or invoke functions and stored procedures. When an INSERT, UPDATE, or DELETE statement activates a trigger, the operations that are executed by the trigger can reference the column values of the rows that are being inserted, updated, or deleted. So-called transition variables allow you to reference the new column values provided in INSERT and UPDATE statements, or the old values that are removed by DELETE or UPDATE statements. You can define triggers on tables with XML columns, and you can also define UPDATE triggers on individual XML columns in a table. Transition variables in triggers do not allow you to access the old or new value of an XML column, which is true in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. But, the transition variables allow you to reference the old or new value of nonXML columns in the same row, such as primary key values. Therefore, triggers can still be used for effective XML manipulation, as you will see in the examples in this section. DB2 for Linux, UNIX, and Windows has one exception where it is possible to reference the new value of an XML column as a transition variable. The exception is that the new value of an XML column can be used in the XMLVALIDATE function to trigger the validation of a document that is being inserted or updated. Such a validation trigger was shown in section 17.5, Automatic Validation with Triggers.
562 18.3.1 Chapter 18 Using XML in Stored Procedures, UDFs, and Triggers Insert Triggers on Tables with XML Columns Let’s look at an example in which triggers maintain the hybrid storage of incoming XML data. Suppose you receive XML documents such as the customer documents in the sample database. For reasons explained in section 2.4, Using a Hybrid XML/Relational Approach, you might decide to store the full document in a column of type XML and to extract a few selected element values into relational columns. For example, you might want to use relational columns to store the customer name and city as well as the type and number of the customer phones. Figure 18.17 defines the appropriate target tables. Since a customer document can contain multiple phone elements, the phone information is stored in a separate table together with a join key. CREATE TABLE cust(cust_id name city info INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY, VARCHAR(30), VARCHAR(25), XML )# CREATE TABLE phones(cust_id type number Figure 18.17 INTEGER NOT NULL, VARCHAR (5), VARCHAR (15) )# Tables for hybrid XML storage Next you can define a trigger that automatically populates the relational columns in both tables whenever an XML document is inserted into the info column with an INSERT statement, such as the following: INSERT INTO cust(info) VALUES(?) An appropriate insert trigger is shown in Figure 18.18. The trigger is fired after a new row is inserted into the cust table but before the INSERT statement commits. The transition variable newrow can be used to reference the column values of the newly inserted row, except for the XML column. For example, newrow.cust_id identifies the generated primary key value of the inserted row. This primary key value allows subselects in the trigger to identify the newly inserted row in the table and to extract the desired element values from the new XML document in that row. Since the XML document cannot be accessed through the transition variable, the trigger accesses the document directly in the table based on the primary key that it finds in the transition variable. The body of the trigger contains an UPDATE statement and an INSERT statement. The UPDATE statement populates the columns name and city in the newly inserted row. The INSERT statement adds rows to the phones table, one row for each phone element in the new document. These rows include the primary key cust_id of the cust table so that the relationship between phones and customers is properly maintained.
18.3 Manipulating XML Data with Triggers 563 CREATE TRIGGER cust_insert AFTER INSERT ON cust REFERENCING NEW AS newrow FOR EACH ROW MODE DB2SQL BEGIN ATOMIC UPDATE cust SET (name, city) = (SELECT X.name, X.city FROM cust, XMLTABLE('$INFO/customerinfo' COLUMNS name VARCHAR(30) PATH 'name', city VARCHAR(20) PATH 'addr/city') AS X WHERE cust.cust_id = newrow.cust_id ) WHERE cust.cust_id = newrow.cust_id; INSERT INTO phones(cust_id, type, number) SELECT cust.cust_id, P.type, P.number FROM cust, XMLTABLE('$INFO/customerinfo/phone' COLUMNS type VARCHAR(5) PATH '@type', number VARCHAR(15) PATH '.') AS P WHERE cust.cust_id = newrow.cust_id; END# Figure 18.18 18.3.2 Insert trigger Delete Triggers on Tables with XML Columns Let’s continue with the preceding example. In addition to the insert trigger you also need a delete trigger that removes the correct rows from the phones table whenever rows are deleted from the cust table. Figure 18.19 shows such a delete trigger. The transition variable oldrow provides access to the cust_id values of the rows deleted in the cust table. These values allow the trigger to delete the corresponding rows in the phones table that have the same cust_id value. CREATE TRIGGER delete_cust AFTER DELETE ON cust REFERENCING OLD AS oldrow FOR EACH ROW MODE DB2SQL BEGIN ATOMIC DELETE FROM phones WHERE phones.cust_id = oldrow.cust_id; END# Figure 18.19 Delete trigger
564 Chapter 18 18.3.3 Using XML in Stored Procedures, UDFs, and Triggers Update Triggers on XML Columns To complete our example, let’s examine the update trigger in Figure 18.20. It maintains the relational columns in the cust and phones tables whenever the info column in the cust table is updated. Note that an update of a customer document might have changed, added, or removed one or multiple phone elements. Thus, the only way to reliably update the phones table is to issue a DELETE followed by an INSERT statement. The UPDATE, DELETE, and INSERT statements in this trigger are the same as in the previous triggers. CREATE TRIGGER update_cust AFTER UPDATE OF info ON cust REFERENCING NEW AS newrow FOR EACH ROW MODE DB2SQL BEGIN ATOMIC UPDATE cust SET (name, city) = (SELECT X.name, X.city FROM cust, XMLTABLE('$INFO/customerinfo' COLUMNS name VARCHAR(30) PATH 'name', city VARCHAR(20) PATH 'addr/city') AS X WHERE cust.cust_id = newrow.cust_id ) WHERE cust.cust_id = newrow.cust_id; DELETE FROM phones WHERE phones.cust_id = newrow.cust_id; INSERT INTO phones(cust_id, type, number) SELECT cust.cust_id, P.type, P.number FROM cust, XMLTABLE('$INFO/customerinfo/phone' COLUMNS type VARCHAR(5) PATH '@type', number VARCHAR(15) PATH '.') AS P WHERE cust.cust_id = newrow.cust_id; END# Figure 18.20 18.4 Update trigger SUMMARY Stored procedures, user-defined functions (UDFs), and triggers are very powerful tools to customize or automate data processing steps for your specific application. DB2 for Linux, UNIX, and Windows allows you to create stored procedures and UDFs with input parameters, output parameters, and variables of type XML. Such procedures and functions can contain XQuery and SQL/XML statements to query and manipulate XML data. The benefit of using the XML data type for parameters and variables is that DB2 keeps the XML data internally in the pureXML parsed tree format. This format enables stored procedures and
18.4 Summary 565 UDFs to process XML much more efficiently than a textual XML representation in VARCHAR or CLOB parameters would allow. For example, a UDF can read and manipulate data from an XML column without XML parsing because the data stays in DB2’s internal XML storage format. If an application passes an XML document to a stored procedure via an XML type parameter, the document is parsed only once upon entry into the procedure. Any subsequent processing steps within the procedure do not require XML parsing. Hence, the XML data type support in stored procedures and UDFs is a significant performance benefit for any custom XML processing logic that you implement. You can also define triggers on tables with XML columns to implement automated actions that are executed when XML documents are inserted, deleted, or updated. In a trigger, transitional variables give you access to the relational values of the affected rows, but not to the old or new value of an affected XML column. In the body of a trigger you can use the relational primary key values of the affected rows to find and access the corresponding XML documents in the table and perform any required operation on them. Stored procedures have been found very useful to encapsulate and hide XML processing from application programs. This reduces application complexity and improves end-to-end performance because SQL/XML statements in DB2 procedures can perform many XML processing tasks more efficiently and with less code than application programs.
This page intentionally left blank
C H A P T E R 19 Performing Full-Text Search ML applications and data can often be classified in one of two ways: predominantly datacentric or predominantly document- or content-centric. For example, the processing of orders, sales, or trades is typically data-centric while the management of contracts, emails, or news articles is document-centric. Content-centric XML documents often contain significant amounts of free-flow text, including full sentences and paragraphs. Such full text is rare in datacentric XML, which tends to contain atomic data values such as names, dates, prices, quantities, or addresses. Therefore, full-text search is more commonly required for querying content-centric XML than data-centric XML documents. X There are also applications that exhibit characteristics of both, data- and document-oriented XML processing. In fact, it is a particular strength of XML to serve as a single format for any combination of data and content. For example, plain text comments can be part of an order, or a description can be part of a product detail record. Wherever individual data items consist of more than one word, and whenever you need to search for substring matches, full-text search can be the right solution. The following topics are discussed in this chapter: • Overview of full-text search capabilities in DB2 (section 19.1) • Sample table and documents used in this chapter (section 19.2) • The DB2 Net Search Extender (sections 19.3 through 19.5) • DB2 Text Search (section 19.6) • Summary of text search administration commands (section 19.7) • Comments on full-text search in DB2 for z/OS (section 19.8) 567
568 Chapter 19 Performing Full-Text Search 19.1 OVERVIEW OF TEXT SEARCH IN DB2 DB2 offers two technologies to perform full-text search. Both of them handle plain text, HTML and XML data, as well as document formats such as PDF and Microsoft Word. • The DB2 Net Search Extender (NSE) has been providing powerful text search capabilities since DB2 8 for Linux, UNIX, and Windows. The Net Search Extender is XML aware and fully functional with the new XML column type in DB2 9 and higher. The DB2 Net Search Extender continues to provide reliable and mature text search in DB2 with proven scalability and performance. • DB2 Text Search is new text search functionality that is based on the technology in the open source project Lucene. The same technology is also used in IBM OmniFind Text Search Server for DB2 z/OS (see section 19.8). DB2 Text Search became first available in DB2 9.5 for Linux, UNIX, and Windows, Fixpack 1. Its features and performance continue to be improved in subsequent releases. DB2 Text Search in DB2 9.5 is just the beginning of integrating OmniFind text search capabilities into DB2 on all platforms. In a given DB2 database you can use either the DB2 Net Search Extender or DB2 Text Search, not both. The DB2 Net Search Extender and DB2 Text Search can coexist in the same database instance, but only one of them can be enabled for a given database. You will find that many DB2 Text Search features and most of its administration commands are identical or similar to those of the DB2 Net Search Extender. The DB2 Net Search Extender and DB2 Text Search have several design principles in common: • A table in which one or multiple columns are indexed for text search must have a primary key. The primary key values of the table are used in the text index to correlate text search results from the text index back to the rows in the table. Consequently, the finest granularity of text search results is a row (a document). • When a text index is created, triggers and a staging table (also known as a log table) are also automatically created in DB2. Any insert, update, or delete on the indexed table fires a trigger that in turn writes corresponding information about the data changes into the staging table. The content of this staging table is read to update the text index, and is subsequently deleted. • Text indexes are maintained asynchronously; that is, not in the context of the original insert, update, or delete statements. Updates of the text index are either explicitly invoked with an UPDATE INDEX command, or they happen regularly on a predefined schedule. Table 19.1 summarizes the most important commonalities and differences between the DB2 Net Search Extender and DB2 Text Search as of DB2 Version 9.5 Fixpack 1.
19.1 Overview of Text Search in DB2 Table 19.1 569 Comparing the DB2 Net Search Extender and DB2 Text Search Feature DB2 Net Search Extender DB2 Text Search Separate Text Search Install Yes No, part of DB2 install DPF Support Yes (on AIX) No Command line interface Yes Yes Administration also through the DB2 Control Center Yes No Administration also through stored procedures No Yes DB2 Backup includes text index No No Asynchronous index updates Yes Yes Synchronous index updates No No Index updates: manual or scheduled Both Both Document models—to index only a subsection (part) of each XML document Yes No Multiple text indexes per column Yes No Indexes on views and nick names Yes No Stop words (avoid indexing irrelevant words, such as "a", "or", and "the") Yes, optional No SQL function: contains Yes Yes XQuery function: No Yes Support for XML namespaces Limited No Can limit the result set size Yes Yes Boolean search (and, or, and not operators for text predicates) Yes (and: &, or: |) Yes (and: &&, or: ||) Wildcards in search predicates Yes Yes Search with escape characters Yes Yes Stemming (reduces search word to its base form) Yes, optional Yes, implicitly Synonym search (Thesaurus) Yes Yes db2-fn:xmlcolumn-contains (continues)
570 Chapter 19 Table 19.1 Performing Full-Text Search Comparing the DB2 Net Search Extender and DB2 Text Search (Continued) Feature DB2 Net Search Extender DB2 Text Search Weighted search Yes Yes Fuzzy search Yes No Proximity search Yes No Ranking/scoring of result set items Yes Yes Case-sensitive search Yes No Linguistic processing (search for linguistic variations of the search term) English only All supported languages 19.2 SAMPLE TABLE AND DATA In the remainder of this chapter we use the following sample table and data to illustrate the text search capabilities in DB2 (see Figure 19.1). You will see that it does not take magic to perform efficient XML full-text search in DB2. CREATE TABLE orders (id INTEGER NOT NULL PRIMARY KEY, doc XML) id 1 2 doc <order date="2007-11-05"> <customer>Wendy Witch</customer> <item key="82"> <name>Crystal Ball, Deluxe Edition</name> <quantity>5</quantity> <price>95.00</price> <comment>Customer requested extra wrapping.</comment> </item> <item key="83"> <name>Magic Potion, 300ml flask</name> <quantity>10</quantity> <price>19.95</price> <comment>Await further shipping instructions.</comment> </item> </order> <order date="2007-11-29"> <customer>William Wizard</customer> <item key="55"> <name>Magician's Hat, Black</name> <quantity>1</quantity> <price>75.00</price> <comment>Must be big enough for the rabbit.</comment> </item> <item key="56"> <name>White Rabbit</name> <quantity>1</quantity> <price>295.00</price> <comment>Extra soft fur and extra white.</comment> </item> </order> Figure 19.1 Sample table and data
19.3 Enabling a Database for the DB2 Net Search Extender 571 Note that the second document contains a single quote in the name of the first item. This quote is not a problem if you import or load the document, or insert with a parameter marker. But, if you execute an insert statement in the DB2 Command Line Processor (CLP) with a literal XML document in the statement, a single quote in an XML value conflicts with the single quotes that enclose the document string. Hence, the first of the three insert statements in Figure 19.2 fails. You can escape the single quote either by using two single quotes or by using the corresponding entity reference (&apos;). --incorrect: INSERT INTO orders VALUES(1, '<name>Magician's Hat</name>'); --correct: INSERT INTO orders VALUES(2, '<name>Magician''s Hat</name>'); INSERT INTO orders VALUES(3, '<name>Magician&apos;s Hat</name>'); Figure 19.2 Inserting XML data with quotes in the CLP 19.3 ENABLING A DATABASE FOR THE DB2 NET SEARCH EXTENDER The DB2 Net Search Extender (NSE) requires a separate install in addition to the regular DB2 install. Appendix C, Further Reading, contains links to information about downloading and installing the NSE. After installation you can start and stop the Net Search Extender instances services much like you start and stop a DB2 server. You have to be the DB2 instance owner to issue the following commands at the OS prompt: db2text start db2text stop [force] The optional keyword force can be used to forcibly stop the NSE even if there are processes still holding locks or if caching for an index is still activated. Be careful with the use of the force option. If you perform db2text stop force while an index update or reorg is in progress, the text index may get damaged and might have to be rebuilt entirely. After starting the DB2 Net Search Extender instance services, the first step is to enable a database for text search. Execute the following command at the OS prompt to enable the database <dbname> for text search: db2text ENABLE DATABASE FOR TEXT CONNECT TO <dbname> As for the majority of the db2text commands, you can optionally provide a user name and password for authentication to the database: db2text ENABLE DATABASE FOR TEXT CONNECT TO <dbname> USER <username> USING <password>
572 Chapter 19 Performing Full-Text Search The ENABLE DATABASE command creates UDFs, stored procedures, and the following tables and views in the default table space of the database: • db2ext.dbdefaults: Contains default values for text search configuration parameters • db2ext.textindexformats: Stores the list of supported index formats and the currently used document models • db2ext.indexconfiguration: Stores index configuration parameters • db2ext.textindexes: Keeps track of all text indexes Similarly, you can disable the DB2 Net Search Extender for a database with the following command, which removes the NSE tables, views, and UDFs, and drops all NSE indexes for that database. db2text DISABLE DATABASE FOR TEXT [force] CONNECT TO <dbname> USER <username> USING <password> 19.4 MANAGING FULL-TEXT INDEXES WITH THE DB2 NET SEARCH EXTENDER The DB2 Net Search Extender allows you to define one or multiple text indexes per column. It also allows you to index only a certain section of each document instead of indexing all elements and attributes in a document. Such partial indexing leads to fewer index entries per document, smaller text indexes, and better index update and search performance. The following sections illustrate the CREATE INDEX command and its various options for the DB2 Net Search Extender. 19.4.1 Creating Basic Text Indexes Issued at the OS command prompt, the following command creates a text index with the name orderIdx on the column doc in the table orders in the database <dbname>: db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) CONNECT TO <dbname> USER <username> USING <password>" Depending on the operating system and configuration of your command shell, enclosing the command parameter for db2text in double quotes might be necessary, as shown in this example. Specifying a user name and a password for authentication to the database is optional. The table orders must have a primary key; otherwise, a text index cannot be created. The column doc must be of type XML or any character or binary column type, such as CHAR, VARCHAR, CLOB, BLOB, DBCLOB, GRAPHIC, or VARCHAR FOR BIT DATA. Unlike relational indexes in DB2, the CREATE INDEX statement for a text index defines an index but does not actually build the text index. An UPDATE INDEX command is required after the CREATE INDEX statement to perform the initial index build (see section 19.4.6).
19.4 Managing Full-Text Indexes with the DB2 Net Search Extender 573 For each text index, the Net Search Extender creates a log table and an event table as well as triggers on the user table. Upon insert, delete, update, or import of data, the triggers fire and write change information into the log table, which is later used to update the index. The event table contains information about index updates and potential problems, such as invalid document formats. If you use the DB2 LOAD utility to move documents into your table, the triggers don’t fire and incremental indexing of the loaded documents does not happen. Therefore, it is recommended to use the DB2 IMPORT utility, which activates the triggers. If you insist on using LOAD for performance reasons, then it is your own responsibility to fill the log table appropriately before issuing the next UPDATE INDEX command. The names of the log table and event table are system-generated. DB2 also creates views on these tables to allow easy inspection of the information. Use the SQL statement in Figure 19.3 to obtain the schema and view names for the index called orderIdx. SELECT eventviewschema, eventviewname, logviewschema, logviewname FROM db2ext.textindexes WHERE indname = 'ORDERIDX' Figure 19.3 Obtaining names of the event and log views for a given text index 19.4.2 Creating Text Indexes with Specific Storage Paths The previous examples used default locations for the text index and the index building work area. The work area is used to hold temporary files that are created when text indexes are built or updated. The default locations are defined in the table DB2EXT.DBDEFAULTS and are typically in /sqllib/db2ext/indexes. This default location is often not a good place for large text indexes. The command in Figure 19.4 specifies that the index is created in the file system /data/index while temporary NSE files are written to /data/temp. Additionally, the log and event tables are placed in the table space named nse_tspace instead of the default user table space. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) INDEX DIRECTORY /data/index WORK DIRECTORY /data/temp ADMINISTRATION TABLES IN nse_tspace CONNECT TO <dbname>" Figure 19.4 Text index with non-default storage locations The DB2 instance owner needs to have read, write, and execute permissions for the index and the work directory. In a DPF system these directories have to exist on every physical node. For best performance, the index and work directories should be allocated on RAID arrays that allow high I/O throughput.
574 Chapter 19 Performing Full-Text Search PERFORMANCE TIP When a text index is created or updated, potentially large amounts of data might have to be moved from the work directory to the index directory. If the index directory and the work directory are located in different file systems, then this move is an expensive copy operation. If the index and work directory are located within the same file system, an inexpensive rename operation can be performed instead of a copy. Hence, for best performance it is highly recommended that the index and work directory share the same file system. The disk space required for an index depends on the amount and type of data that is being indexed and on the length of the primary key in the user table. Since the primary key is part of the index, short keys such as INTEGER or TIMESTAMP are preferable over long keys, such as CHAR(128). As a rule of thumb you should reserve at least 0.7 times as much space for the text index as the size of the data volume you want to index. The work area can require two to three times as much space as the raw data. 19.4.3 Creating Text Indexes with a Periodic Update Schedule By default a text index is not updated automatically. You have to use the explicit UPDATE INDEX command whenever you want to refresh the text index, or configure the index for regularly scheduled index updates. The CREATE INDEX statement in Figure19.5 defines a text index that is automatically refreshed four times a day. The string D(*)H(0,6,12,18)M(30) means that the index is updated every day at 0:30, 6:30, 12:30, and 18:30 hours. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) UPDATE FREQUENCY D(*)H(0,6,12,18)M(30) CONNECT TO <dbname>" Figure 19.5 Text index with automatic periodic updates Alternatively, the string D(1,2,3,4,5)H(*)M(0,15,30,45) would mean that the index gets updated Monday through Friday every 15 minutes. You will see later that there is also an ALTER INDEX command in which you can use the UPDATE FREQUENCY clause to define or change automatic updates for existing indexes. System load considerations and the time it takes for an index update to finish should be the guiding factors for choosing an appropriate update interval that is not too short. An update interval of one minute is almost always the wrong thing to do. NOTE
19.4 Managing Full-Text Indexes with the DB2 Net Search Extender 575 Depending on your application, you might want to avoid index maintenance at the scheduled times if there was only an insignificant number of changes to your data since the last time the index was updated. Figure 19.6 creates an index that is updated every 30 minutes if there are at least 50 document changes queued up in the log table. If there are less than 50 changes in the log table, the index is not updated. After 30 minutes, the scheduler checks again whether 50 or more changes have accumulated. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) UPDATE FREQUENCY D(*)H(*)M(0, 30) UPDATE MINIMUM 50 CONNECT TO <dbname>" Figure 19.6 Text index with automatic updates when “enough” new rows are available Such a combination of UPDATE FREQUENCY and UPDATE MINIMUM allows you to define an index update schedule in which the index is updated more frequently when there are many changes in the base table and less frequently if there are fewer changes. If omitted, the default value for UPDATE MINIMUM is 1. Instead of updating the index incrementally you can also choose to always re-create the index from scratch. Figure 19.7 defines an index that is recreated entirely every night at 2 a.m. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) UPDATE FREQUENCY D(*)H(2)M(0) RECREATE INDEX ON UPDATE CONNECT TO <dbname>" Figure 19.7 Text index with automatic re-create If you define an index with the RECREATE option, no log table and no triggers are created for this index. Use this option with caution as rebuilding a large text index can take a long time. Note that the DB2 Control Center allows you to administrate the DB2 Net Search Extender and to configure the update behavior of text indexes. When you right-click on a database name you are presented with the option to enable the database for text search. A right-click on the index folder of a database lets you create regular relational indexes but also text indexes. A multi-step wizard guides you through the text index definition and allows you to change default parameters such as index location and update characteristics. Figure 19.8 illustrates step 4 of the Create Text Index Wizard, where you can set the frequency of automatic updates. The settings selected in Figure 19.8 result in a CREATE INDEX statement with the clause UPDATE FREQUENCY D(1) H(3) M(30).
576 Chapter 19 Figure 19.8 19.4.4 Performing Full-Text Search Create Text Index Wizard in the DB2 Control Center Creating Text Indexes for Specific Parts of Each Document When you define a text index on an XML column, the DB2 Net Search Extender creates index entries for all XML elements and attributes in the XML documents in the column. But, indexing all parts of the documents is not always necessary. Let’s look at the sample document in Figure 19.1. If you manage many “order” documents of this nature, you might want to perform full-text search on item names and comments. In that case, creating a full-text index on these elements is sufficient and leads to a much smaller index as compared to indexing all elements and attributes. A smaller index often allows better update and search performance. If you also need to perform queries with predicates on short data values— such as order date, customer name, item key, quantity, and price—you should use regular XML indexes. With the Net Search Extender you can use document models to control which parts of the document structure are and aren’t indexed, and by which name you can refer to these parts in search queries. A document model itself is a small XML document in the file system. This model file is passed as a parameter to the CREATE INDEX command and is read during index creation only. Later changes to the document model do not affect existing indexes.
19.4 Managing Full-Text Indexes with the DB2 Net Search Extender 577 Figure 19.9 shows a simple document model for documents like the ones in Figure 19.1. This document model declares that only item names and comments are indexed. Every XML document model starts with the element XMLModel, which includes one or multiple XMLFieldDefinition elements. Each XMLFieldDefinition assigns a name to a locator. The locator is a simple XPath expression that defines which elements, attributes, or subtrees to index. The locator can contain XPath wildcards (*), namespace prefixes, the XPath union operator (|), and the XPath descendant-and-self axis, which is also known as the “double slash” (//). <?xml version="1.0"?> <XMLModel> <XMLFieldDefinition name="iName" locator="/order/item/name"/> <XMLFieldDefinition name="iComments" locator="/order/item/comment"/> </XMLModel> Figure 19.9 A simple document model If the document model is stored in the file itemModel.xml, then the following command defines a full-text index for item names and comments: db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) FORMAT XML DOCUMENTMODEL XMLModel IN itemModel.xml CONNECT TO <dbname>" Note that you might have to specify a full file system path to the model file. The document model in Figure 9.10 declares that all elements under /order/item are indexed, except for the items quantity and price, which are explicitly excluded. Depending on the actual data in the XML column, and on the existence of other elements under /order/item, this document model can index more information than the previous one in Figure 19.9. However, for the sample documents in Figure 19.1, both document models index exactly the item name and comment. We will later use these document models in text search queries. <?xml version="1.0"?> <XMLModel> <XMLFieldDefinition name="item" locator="/order/item/*" /> <XMLFieldDefinition name="excl1" locator="/order/item/quantity exclude="yes"/> <XMLFieldDefinition name="excl2" locator="/order/item/price" exclude="yes"/> </XMLModel> Figure 19.10 A document model with exclude
578 19.4.5 Chapter 19 Performing Full-Text Search Creating Text Indexes with Advanced Options The CREATE INDEX statement can use the optional INDEX CONFIGURATION clause to set one or multiple additional configuration parameters for the index. The index definition can also specify a transformation function that is applied to each document before indexing. Let’s look at some examples. Stop words are words that occur frequently but have little relevance for text search. In the English language, frequent stop words are "a", "or", "in", "the", and so on. By default the DB2 Net Search Extender includes stop words in the text index, which can increase the index size and reduce the precision of the search results. Therefore you might want to ignore stop words. In the CREATE INDEX command in Figure 19.11, the index configuration parameter IndexStopWords 0 advises the Net Search Extender not to index stop words. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) LANGUAGE PT_BR UPDATE FREQUENCY D(*)H(2)M(0) INDEX CONFIGURATION (IndexStopWords 0, UpdateDelay 30, IgnoreEmptyDocs) CONNECT TO <dbname>" Figure 19.11 Text index with update delay and exclusion of stop words and empty documents The Net Search Extender installation includes text files with lists of stop words for more than 20 languages. These stop word files cannot (easily) be edited. The LANGUAGE clause in Figure 19.11 specifies that the documents are assumed to be in Brazilian Portuguese, which implies that the stop words to ignore are in the file /sqllib/db2ext/resources/PT_BR.tsw. The use of stop words is language specific and works best if all documents that are indexed are in the same language. In Figure 19.11, the index configuration parameter UpdateDelay 30 specifies that at the time of incremental update, which is 2 a.m. in this example, only entries older than 30 seconds are taken from the log table. Any change records younger than 30 seconds are processed on the next incremental update. This deferral avoids lost updates when user transactions that modify the base table overlap with the starting point of the incremental update. Therefore, the UpdateDelay parameter should be set to the maximum expected duration of a user write transaction on the table that the index was created on. The index configuration parameter IgnoreEmptyDocs specifies that rows where the indexed column is empty or NULL are not represented in the index. If you index a column that has a significant percentage of NULL values, this option can reduce the size of the index. Figure 19.12 shows a text index definition that applies a function functionname to the values in the column that is being indexed. This function can be any built-in or user-defined function but it must produce a single value of type XML or any character or binary data type. A transformation
19.4 Managing Full-Text Indexes with the DB2 Net Search Extender 579 function can be useful to transform, extend, or shorten the data before it is passed to the indexer. Beware that a complex or inefficient transformation can have a drastic impact on the performance of building or updating a text index. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(functionname(doc)) CONNECT TO <dbname> USER <username> USING <password>" Figure 19.12 19.4.6 Text index with transformation function Updating and Reorganizing Text Indexes Unlike relational indexes in DB2, the CREATE INDEX statement for a text index only defines an index, but it does not actually build the index structure. The index is built, and subsequently updated, when a scheduled index update takes place or when the explicit UPDATE INDEX command is issued. For example, after defining a new text index with the name orderIdx, you might want to run this command to build the index immediately: db2text "UPDATE INDEX orderIdx FOR TEXT CONNECT TO <dbname>" For large text indexes you may wish to check the progress of the index update process. The CONTROL LIST ALL LOCKS command displays information about the locks currently held for a specific index or database: db2text "CONTROL LIST ALL LOCKS FOR DATABASE <dbname> INDEX orderIdx CONNECT TO <dbname>" If there is an update lock, this command also prints the number of documents that have been processed so far. The DB2 Net Search Extender continuously monitors the quality of the index structure and determines whether a reorganization of the index is recommended. If periodic index updates are scheduled they also automatically reorganize the index when needed. These automatic reorganizations relieve the DBA from the decision when to reorganize and ensure decent index performance over time. If you choose to disable automatic index reorganization you need to specify REORGANIZE MANUAL in the CREATE INDEX statement, as in Figure 19.13. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) UPDATE FREQUENCY D(*)H(1,13)M(0) UPDATE MINIMUM 100 REORGANIZE MANUAL CONNECT TO <dbname>" Figure 19.13 A text index without automatic reorganization
580 Chapter 19 Performing Full-Text Search Note that the manual reorganization property of a text index can only be set in the CREATE INDEX statement and cannot be altered later. Hence, this option should be used with care. We recommend that you use the default, which is automatic reorganization. If you decide not to rely on automatic reorganization you should manually reorganize the index from time to time. You can use the following SQL statement to check whether reorganization for a given index is recommended: SELECT reorg_suggested FROM db2ext.textindexes WHERE indname = 'ORDERIDX' The following UPDATE INDEX command forces reorganization of an index explicitly: db2text "UPDATE INDEX orderIdx FOR TEXT REORGANIZE CONNECT TO <dbname>" 19.4.7 Altering Text Indexes The ALTER INDEX command allows you to change the update frequency of the index and work directories for a particular text index. Figure 19.14 shows three sample commands. The first command sets the update frequency of the text index, such that it is updated every hour if at least 100 text documents have been changed. The second command disables automatically scheduled index updates. The third command moves the index and work directories to a new storage location. The index is locked and cannot be used for queries while the storage areas are being moved. db2text "ALTER INDEX orderIdx FOR TEXT UPDATE FREQUENCY D(*)H(*)M(0) UPDATE MINIMUM 100 CONNECT TO <dbname>" db2text "ALTER INDEX orderIdx FOR TEXT UPDATE FREQUENCY NONE CONNECT TO <dbname>" db2text "ALTER INDEX orderIdx FOR TEXT INDEX DIRECTORY /newstorage/index WORK DIRECTORY /newstorage/temp CONNECT TO <dbname>" Figure 19.14 Three ALTER INDEX commands for text indexes
19.5 Performing XML Full-Text Search with the DB2 Net Search Extender 581 19.5 PERFORMING XML FULL-TEXT SEARCH WITH THE DB2 NET SEARCH EXTENDER The DB2 Net Search Extender offers three methods for performing full-text search: • SQL scalar functions: contains, score, numberofmatches These functions are seamlessly integrated into SQL and provide the most flexible approach for text search. You can use these functions as you would use any other functions in SQL queries. The DB2 query optimizer estimates the selectivity of CONTAINS predicates and uses this information to generate efficient access plans for SQL queries that include text search. • Text search table function: A key benefit of the DB2 Net Search Extender table function is that it allows full-text search on views. It returns a set of primary key values from the text index that you need to join with the base table to obtain the actual search results. • Text search stored procedure: The DB2 Net Search Extender stored procedure can perform high-performance search against a predefined in-memory cache of user data. This type of search cannot be used in arbitrary SQL queries and does not allow automatic update of the text index. In the remainder of this section we focus on text search with the SQL scalar functions since they are the most flexible and most commonly used search method. They are suitable in the majority of application scenarios and the only way to perform text search on partitioned tables in a DPF database. Unless otherwise noted, the following examples all use the following simple text index without a custom document model: db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) CONNECT TO <dbname>" 19.5.1 Full-Text Search in SQL and XQuery The scalar functions CONTAINS, SCORE, and NUMBEROFMATCHES take two arguments: a column name and a set of search criteria, such as search terms and additional conditions on them. For each row in the column, the function CONTAINS returns the value 1 if the document in the row matches the search criteria, and 0 otherwise. The query in Figure 19.15 returns all the rows from the orders table where the XML document in the column doc contains the word Deluxe. For the sample data in Figure 19.1, only the first of the two rows is returned. SELECT doc FROM orders WHERE CONTAINS(doc, ' "Deluxe" ') = 1 Figure 19.15 Find all documents that contain the word Deluxe
582 Chapter 19 Performing Full-Text Search Note that the search condition, which can be more complex than a single word, is enclosed in single quotes. The search term Deluxe itself is enclosed in double quotes. The query in Figure 19.16 looks for documents that have two or more occurrences of the word rabbit. The function NUMBEROFMATCHES returns an integer value that indicates how often the search term occurs in a given document. The function SCORE returns a DOUBLE value indicating how well a document meets the search conditions relative to other documents in the same index. Among other things, the score is calculated based on the ratio between the number of matches found in the document and the document’s size. The query in Figure 19.16 returns the documents in order of decreasing relevance. However, only the second of the two rows in Figure 19.1 matches the search condition. Text search is case-insensitive by default, which is why two occurrences of the word rabbit are found in the second document of our sample data. SELECT SCORE(doc, ' "rabbit" ') AS score, doc FROM orders WHERE NUMBEROFMATCHES(doc, ' "rabbit" ') >= 2 ORDER BY score DESC Figure 19.16 Find documents that contain the word rabbit at least twice Since these queries are regular SQL statements they can be arbitrarily extended with relational predicates, joins, aggregation, and other language constructs. Since the functions CONTAINS, SCORE, and NUMBEROFMATCHES belong to the SQL domain you cannot use them in XQuery directly. Also, the fn:contains function in the XQuery language does not exploit any text indexes. Instead, use the db2-fn:sqlquery function to include text search in an XQuery. The query in Figure 19.17 returns the customer element from all orders that contain the word Deluxe: xquery for $i in db2-fn:sqlquery("SELECT doc FROM orders WHERE CONTAINS(doc, ' ""Deluxe"" ') = 1")/order return $i/customer Figure 19.17 Simple text search in XQuery Since XML data has more structure than regular plain text in VARCHAR or CLOB columns, you can apply the text search condition to specific elements, attributes, or subtrees through the use of XPath. The search condition (still in single quotes) now consists of two parts: a section and a search term. The section defines where to look for the search term. The section is separated from the search term by any amount of whitespace. The XPath can contain only the child axis (/), the attribute axis (@), and namespace prefixes. The query in Figure 19.18 retrieves order documents in which any item name contains the word Deluxe.
19.5 Performing XML Full-Text Search with the DB2 Net Search Extender 583 SELECT doc FROM orders WHERE CONTAINS(doc, 'SECTION("/order/item/name") "Deluxe" ') = 1 Figure 19.18 Text search along a specific XPath in your XML data Wildcard search is a technique to match words or phrases in your data that are not exactly the same as the search terms that you provide in the query. Searching with wildcards is very intuitive. You can use the underscore character “_” to match any single character in a word, and the percent sign “%” to match any sequence of multiple characters. Both queries in Figure 19.19 find documents where the word Deluxe appears in an item name. The second of the two queries also matches documents with words such as Delete or Departure in an item name. SELECT doc FROM orders WHERE CONTAINS(doc, ' SECTION("/order/item/name") "De_uxe" ')=1; SELECT doc FROM orders WHERE CONTAINS(doc, ' SECTION("/order/item/name") "De%e" ')=1; Figure 19.19 19.5.2 Text search queries with wildcards Full-Text Search with Boolean Operators The DB2 Net Search Extender supports the Boolean operators AND (&), OR (|), and NOT. You can use these operators with and without sections in the search conditions. Figure 19.20 shows two equivalent queries that return all documents that contain the word Deluxe or the word Crystal or both in any item name. The second of the two queries is logically equivalent to the first because a list of search terms implicitly uses the OR operator. SELECT doc FROM orders WHERE CONTAINS(doc,'SECTION("/order/item/name") "Deluxe" | "Crystal" ')=1; SELECT doc FROM orders WHERE CONTAINS(doc,'SECTION("/order/item/name") ("Deluxe","Crystal") ')=1; Figure 19.20 Find documents that contain the words Deluxe or Crystal or both
584 Chapter 19 Performing Full-Text Search In contrast, the query in Figure 19.21 identifies the documents that contain both Deluxe and Crystal in an item name. SELECT doc FROM orders WHERE CONTAINS(doc,' SECTION("/order/item/name") "Deluxe" & "Crystal" ') = 1 Figure 19.21 Find documents that contain two search terms in a specific element The queries in Figure 19.22, which use the Boolean operator AND between two section expressions, have a different meaning than the query in Figure 19.21. While the previous query returns documents where Deluxe and Crystal appear in the same item name, the queries in Figure 19.22 also return documents, where Deluxe and Crystal occur in different item names. Remember that a single order can contain multiple items and those items can have different names. The second query in Figure 19.22 is logically equivalent to the first, but it is typically less efficient because it calls the contains UDF twice. SELECT doc FROM orders WHERE CONTAINS(doc,'SECTION("/order/item/name") "Deluxe" & SECTION("/order/item/name") "Crystal" ')=1; SELECT doc FROM orders WHERE CONTAINS(doc, ' SECTION("/order/item/name") "Deluxe" ')=1 AND CONTAINS(doc, ' SECTION("/order/item/name") "Crystal" ')=1; Figure 19.22 Two equivalent queries that are different from Figure 19.21 The Boolean operators also work without using a section in the search condition. Figure 19.23 shows a query that returns all documents that contain words that start with Magic or Crystal but that do not contain the word Potion. Only the second of the two documents in Figure 19.1 is returned. This query uses parentheses to force a certain evaluation order of the Boolean operators. The OR (|) is evaluated before the AND (&). Without the parentheses the AND is evaluated first, because AND has precedence over OR, as in regular Boolean logic. SELECT doc FROM orders WHERE CONTAINS(doc,'("Magic%" | "Crystal%") & NOT "Potion" ')=1 Figure 19.23 Combining the Boolean operators OR (|), AND (&), and NOT
19.5 Performing XML Full-Text Search with the DB2 Net Search Extender 19.5.3 585 Full-Text Search with Custom Document Models If you always restrict the text search to item names and item comments in the documents then the use of a custom document model can help reduce the size of the index and improve performance. Let’s revisit the document model and index definition from section 19.4.4, which are repeated here for convenience in Figure 19.24 and Figure 19.25, respectively. <?xml version="1.0"?> <XMLModel> <XMLFieldDefinition name="iName" locator="/order/item/name"/> <XMLFieldDefinition name="iComments" locator="/order/item/comment"/> </XMLModel> Figure 19.24 Document model, stored in the file /models/itemModel.xml db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) FORMAT XML DOCUMENTMODEL XMLModel IN /models/itemModel.xml CONNECT TO <dbname>" Figure 19.25 Text index based on the document model in Figure 19.24 With this document model in the index definition you can simply use the name of an XMLFieldDefinition to identify the section of the document in which you want to search. For example, the first query in Figure 19.26 returns documents that contain the word Deluxe in the item name without specifying the actual XPath to the item name element. The second query returns documents where the word Deluxe appears in the item name or in the item comment. SELECT doc FROM orders WHERE CONTAINS(doc, ' SECTION("iName") "Deluxe" ') = 1; SELECT doc FROM orders WHERE CONTAINS(doc, ' SECTION("iName") "Deluxe" | SECTION("iComment") "Deluxe" ') = 1; Figure 19.26 Focused search queries that reference sections of a document by name If you always want to search across item name and comment without distinguishing between the two, then it is even simpler to use the document model with the XPath union operator (|) in Figure 19.27. Then you can run the simplified query in Figure 19.28 to search across item name and comment.
586 Chapter 19 Performing Full-Text Search <?xml version="1.0"?> <XMLModel> <XMLFieldDefinition name = "NameComment" locator = "/order/item/(name|comment)" /> </XMLModel> Figure 19.27 Document model with the XPath union operator (|) SELECT doc FROM orders WHERE CONTAINS(doc, ' SECTION("NameComment") "Deluxe" ') = 1 Figure 19.28 19.5.4 Query that uses the document model in Figure 19.27 Advanced Search with Proximity, Fuzzy, and Stemming Options Proximity search allows you to search for words that appear within the same sentence in a larger piece of text in an XML text node. The query in Figure 19.29 retrieves documents where the word rabbit occurs in the same sentence as the phrase big enough. It does not matter whether rabbit occurs before or after big enough. SELECT doc FROM orders WHERE CONTAINS(doc, ' SECTION("/order/item/comment") "rabbit" IN SAME SENTENCE AS "big enough" ') = 1 Figure 19.29 Proximity search Stemming means that the DB2 Net Search Extender looks for words that have the same stem (or base form) as the search term. When the query in Figure 19.30 is executed, the search term wrapped is first reduced to its stem wrap. Then the query returns documents that contain words such as wrap, wrapping, wraps, and so on in an item comment. Stemming is language dependent and currently only supported for English. SELECT doc FROM orders WHERE CONTAINS(doc, ' SECTION("/order/item/comment") STEMMED FORM OF "wrapped" ') = 1 Figure 19.30 Text search with stemming Fuzzy search allows you to look for words that are spelled similarly to the search term that you provide. Specify a value between 1 and 100 to indicate the desired degree of similarity, where 100 requires an exact match and anything below 100 is increasingly fuzzy, which helps you overcome spelling mistakes in the data or search terms. The query in Figure 19.31 retrieves documents
19.5 Performing XML Full-Text Search with the DB2 Net Search Extender 587 where the word wrapping occurs in an item comment although the search term wraping is spelled with a single p. SELECT doc FROM orders WHERE CONTAINS(doc, ' SECTION("/order/item/comment") FUZZY FORM OF 85 "wraping" ') = 1 Figure 19.31 19.5.5 Fuzzy search Finding the Correct Match within an XML Document All the sample queries discussed so far have retrieved a full document or row when a match was found in the document. But, since there can be multiple items per order, you might want to read only those items from an order document that actually match the text search condition. For example, the query in Figure 19.32 is not suitable to retrieve the items that contain the word Crystal in their name. The WHERE clause qualifies an entire row, and the XMLQUERY function in the SELECT clause extracts all items from the matching document. Hence, this query returns two items from the sample data in Figure 19.1, the Crystal Ball and the Magic Potion. Both belong to the same document but only the Crystal Ball matches the search condition. SELECT XMLQUERY('$DOC/order/item') FROM orders WHERE CONTAINS(doc,'SECTION("/order/item/name") "Crystal" ')=1 Figure 19.32 This query does not find the correct match within a document. To extract the desired item only, the XMLQUERY function must include a filtering condition based on the XQuery function fn:contains. The query in Figure 19.33 uses the Net Search Extender predicate CONTAINS in the WHERE clause to access the index and to find matching documents. The XQuery function fn:contains in the SELECT clause does not use the text index and only searches within any document that matches the WHERE clause. SELECT XMLQUERY('$DOC/order/item[fn:contains(name, "Crystal")]') FROM orders WHERE CONTAINS(doc,'SECTION("/order/item/name") "Crystal" ')=1 Figure 19.33 This query finds the correct match within a document If the number of matching documents is moderate, then the overhead of the additional fn:contains predicate in the XMLQUERY function is relatively small, because it is applied only to those documents that already match the WHERE clause. The XQuery function fn:contains does not support fuzzy search, proximity search, stemming, or other advanced search options. Hence, such techniques cannot be used to find a specific item within a document.
588 19.5.6 Chapter 19 Performing Full-Text Search Search Conditions on Sibling Branches of an XML Document Assume you want to find order documents that contain the string Magic in an item name and the string fur in the comment of the same item. In other words, you have two text search conditions and a positional relationship between them. The conditions are required to match elements (or branches) that are rooted under a common parent, which is the item element in this example. The query in Figure 19.34 comes to mind but it does not represent the desired search semantics. SELECT doc FROM orders WHERE CONTAINS(doc,'SECTION("/order/item/name") "Magic%" & SECTION("/order/item/comment") "fur%" ')=1 Figure 19.34 Query with predicates that can match different item elements This query returns both documents from the sample table in Figure 19.1 although only the first document has an item with the string Magic in the item name and the string fur in the same item’s comment. The problem with this query is that the sections /order/item/name and /order/item/comment can match different items in the same document. The second document in Figure 19.1 contains one item whose name starts with Magic and a different item where the string fur occurs in the comment. This document is a valid match for the query in Figure 19.34 but not what we wanted. The DB2 Net Search Extender by itself is not sufficient to fully express the desired query. You need to add additional XPath predicates with XMLEXISTS to properly constrain the name and the comment of the same item, as in Figure 19.35. The additional fn:contains predicates within XMLEXISTS do not use any index but they narrow down the intermediate result set from the Net Search Extender text predicates and index. SELECT doc FROM orders WHERE CONTAINS(doc, 'SECTION("/order/item/name") "Magic%" & SECTION("/order/item/comment") "fur%" ')=1 AND XMLEXISTS('$DOC/order/item[fn:contains(name,"Magic") and fn:contains(comment,"fur")]') Figure 19.35 19.5.7 Extra predicates force both predicates to match the same item element Text Search in the Presence of Namespaces The DB2 Net Search Extender has simple support for namespaces. It does not allow you to declare namespaces in the CONTAINS predicates, but you can use namespace prefixes in XPath expressions when you specify a document section. Let’s consider the sample data in Figure 19.36 where the first document has a namespace with an explicit prefix ns, the second document uses a default namespace, and the third document has no namespace at all.
19.5 Performing XML Full-Text Search with the DB2 Net Search Extender id 1 2 3 589 doc <ns:order xmlns:ns="http://www.witchcraft.org" date="2007-11-05"> <ns:customer>Wendy Witch</ns:customer> <ns:item key="82"> <ns:name>Crystal Ball, Deluxe Edition</ns:name> <ns:quantity>1</ns:quantity> <ns:price>95.00</ns:price> <ns:comment>Customer needs extra wrapping.</ns:comment> </ns:item> </ns:order> <order xmlns="http://www.witchcraft.org" date="2007-11-05"> <customer>Wendy Witch</customer> <item key="82"> <name>Crystal Ball, Deluxe Edition</name> <quantity>1</quantity> <price>95.00</price> <comment>Customer needs extra wrapping.</comment> </item> </order> <order date="2007-11-05"> <customer>Wendy Witch</customer> <item key="82"> <name>Crystal Ball, Deluxe Edition</name> <quantity>1</quantity> <price>95.00</price> <comment>Customer needs extra wrapping.</comment> </item> </order> Figure 19.36 Sample data with namespaces To illustrate text search with namespaces, let’s look for documents where the word Crystal appears in an item name. All three documents in Figure 19.36 contain this word, but they differ in their namespaces. Now consider the two queries in Figure 19.37. The first query uses exactly the same namespace prefix as the first document in Figure 19.36. Hence, only this document (with ID 1) is returned: The second query does not use namespace prefixes in the section specification and returns the second and the third document from Figure 19.36. The Net Search Extender does not distinguish between a default namespace and no namespace. A document with a default namespace is treated as if there was no namespace. Consequently, the Net Search Extender does not recognize that the first two documents in Figure 19.36 are equivalent in terms of their namespaces. The Net search Extender merely treats namespace prefixes as if they were parts of the local element names and ignores namespace declarations and URIs.
590 Chapter 19 Performing Full-Text Search -- returns only the first document: SELECT doc FROM orders WHERE CONTAINS(doc,'SECTION("/ns:order/ns:item/ns:name") "Crystal" ')=1; -- returns the second and third document: SELECT doc FROM orders WHERE CONTAINS(doc,'SECTION("/order/item/name") "Crystal" ')=1; Figure 19.37 19.6 Two queries against the sample data with namespaces DB2 TEXT SEARCH DB2 Text Search is new text search functionality based on the technology in the open source project Lucene, which is also used in OmniFind. The DB2 Net Search Extender and DB2 Text Search can coexist in the same database instance, but only one of them can be used for a given database. You will find that many DB2 Text Search features and most of its administration commands are identical or very similar to those of the DB2 Net Search Extender. Therefore we do not repeat all of these commands and features but focus predominantly on those that are different in DB2 Text Search. Section 19.7 provides a summary and comparison of the most relevant administrative commands for DB2 Text Search and the DB2 Net Search Extender. 19.6.1 Enabling a Database for DB2 Text Search The first step to using DB2 Text Search is to start the DB2 Text Search instance service. You must be the DB2 instance owner to run the following command at the OS prompt: db2ts START FOR TEXT Next you enable a database for text search operations. Issue the following command to enable the database <dbname>: db2ts ENABLE DATABASE FOR TEXT CONNECT TO <dbname> This command is identical to the corresponding command for the DB2 Net Search Extender, except that the executable db2ts is used instead of db2text. Optionally you can provide a username and password for authentication to the database: db2ts ENABLE DATABASE FOR TEXT CONNECT TO <dbname> USER <username> USING <password>
19.6 DB2 Text Search 591 The ENABLE DATABASE command creates the following tables and views in the default table space of the database: • SYSIBMTS.TSDEFAULTS: Contains default values for text search configuration parameters • SYSIBMTS.TSINDEXES: Contains one row with meta information for each text index • SYSIBMTS.TSLOCKS: Contains dynamic information about database and index locks • SYSIBMTS.TSCONFIGURATION: Contains index-level configuration parameters • SYSIBMTS.TSCOLLECTIONNAMES: Associates text index names to internal collection names Similarly, you can disable DB2 Text Search for a database, which removes these tables and views, with the following command: db2ts DISABLE DATABASE FOR TEXT [FORCE] CONNECT TO <dbname> [USER <username> USING <password>] This command removes all tables and views in the SYSIBMTS schema. Use the optional keyword FORCE to forcibly drop text indexes from the database. If this option is not specified and text indexes still exist for this database, the command fails. You can also use the separate DROP INDEX command to remove text indexes. 19.6.2 Creating and Maintaining Full-Text Indexes for DB2 Text Search DB2 Text Search allows you to define at most one index per column. The index contains entries for the entire document in the column, which is why you never need more than one text index per column. You cannot use a document model to index only certain sections of each document. The following examples illustrate the CREATE INDEX command for DB2 Text Search. The following command creates a text index with the name orderIdx on the column doc in the table orders in the database <dbname>: db2ts "CREATE INDEX orderIdx FOR TEXT ON orders(doc) CONNECT TO <dbname>" Depending on the operating system and configuration of your command shell, you might have to enclose the command parameter for db2ts in double quotes, as shown in this example. The table orders must have a primary key. The column doc must be of type XML or any character or binary column type, such as CHAR, VARCHAR, CLOB, BLOB, DBCLOB, GRAPHIC, or VARCHAR FOR BIT DATA.
592 Chapter 19 Performing Full-Text Search The preceding CREATE INDEX statement uses the same syntax as in the DB2 Net Search Extender, except that the executable is db2ts instead of db2text. As for the DB2 Net Search Extender, the CREATE INDEX statement for a text index defines an index but does not actually build the text index. An UPDATE INDEX command is required after the CREATE INDEX statement to physically populate the index. Additional options in the CREATE INDEX syntax, such as defining an update frequency or update minimum, are also the same for DB2 Text Search and the DB2 Net Search Extender. Other administrative commands, such as UPDATE INDEX, DROP INDEX, and so on are also common for the DB2 Text Search and the DB2 Net Search Extender. 19.6.3 Writing DB2 Text Search Queries for XML Data DB2 Text Search allows you to write search queries with CONTAINS predicates much like the DB2 Net Search Extender. Additionally, DB2 Text Search supports the function db2-fn: xmlcolumn-contains, which enables you to perform full-text search in XQuery without any use of SQL. The query in Figure 19.38 returns the customer information from all orders that contain the word Deluxe anywhere in the document. The syntax of this query is the same as for the DB2 Net Search Extender. SELECT XMLQUERY('$DOC/order/customer') FROM orders WHERE CONTAINS(doc, ' "Deluxe" ') = 1 Figure 19.38 SQL-based text search with the CONTAINS function Applications that use XQuery only and prefer to avoid using SQL can write the same query with the function db2-fn:xmlcolumn-contains, which is an extension of the regular db2fn:xmlcolumn function. In addition to the XML column name it takes a search argument as a second input parameter, as in Figure 19.39. xquery for $i in db2-fn:xmlcolumn-contains('ORDERS.DOC', ' "Deluxe" ') return $i/order/customer Figure 19.39 XQuery-based text search with the db2-fn:xmlcolumn-contains function While db2-fn:xmlcolumn('ORDERS.DOC') returns all XML documents from the column DOC in the ORDERS table, the db2-fn:xmlcolumn-contains function returns in this example only those documents that contain the word Deluxe.
19.6 DB2 Text Search 19.6.4 593 Full-Text Search with XPath Expressions DB2 Text Search differs from the DB2 Net Search Extender in the syntax for searching within a specific path of your XML documents. Figure 19.40 shows two equivalent queries and compares the respective syntax. Both queries return all order documents that have an item with a name that contains the word Deluxe. -- DB2 Net Search Extender query: SELECT doc FROM orders WHERE CONTAINS(doc, 'SECTION("/order/item/name") "Deluxe" ') = 1; -- DB2 Text Search query: SELECT doc FROM orders WHERE CONTAINS(doc, '@xpath:''/order/item/name[. contains("Deluxe")]'' ')=1; Figure 19.40 Comparison of DB2 Text Search and NSE query syntax Note that the syntax for DB2 Text Search includes the contains function also within the square brackets of the XPath predicate. The dot that precedes the keyword contains is the current XPath context and refers to the name element. Alternatively you can include the name element itself in the square brackets, as shown in Figure 19.41, which produces the same result. SELECT XMLQUERY('$DOC/order/customer') FROM orders WHERE CONTAINS(doc, '@xpath:''/order/item[name contains("Deluxe")]'' ')=1; Figure 19.41 DB2 Text Search query with XPath The query in Figure 19.41 also uses the XMLQUERY function in the SELECT clause to extract and return only the customer element from the matching documents. The same can be coded in XQuery notation, as shown in Figure 19.42. xquery for $i in db2-fn:xmlcolumn-contains('ORDERS.DOC', '@xpath:''/order/item[name contains("Deluxe")]'' ') return $i/order/customer; Figure 19.42 Text search in XQuery The XPath navigation in the search argument can contain the child axis (/), the attribute axis (@), the descendent-or-self axis (//), a reference to the current context node (.), and of course element and attribute names. Wildcards (*) are not allowed.
594 19.6.5 Chapter 19 Performing Full-Text Search Full-Text Search with Wildcards While the DB2 Net Search Extender uses the underscore (_) to denote a single character wildcard and the percent sign (%) as a multicharacter wildcard, DB2 Text Search uses the question mark (?) and the star (*) instead. The two queries in Figure 19.43 illustrate the use of these wildcards. The first query returns documents that contain words such as Deluxe with an arbitrary character at the third position of the word. The second query also matches documents that contain words such as Delete or Departure. SELECT doc FROM orders WHERE CONTAINS(doc,' "De?uxe" ')=1; SELECT doc FROM orders WHERE CONTAINS(doc,' "De*e" ')=1; Figure 19.43 DB2 Text Search queries with wildcards Note that wildcards at the beginning of a search term, such as ?eluxe or *eluxe, often reduce query performance and should be avoided. NOTE DB2 Text Search offers a variety of additional search capabilities that are not specific to XML but applicable to text search in general. For example, you can add the + and – signs to specific search terms to indicate that they are required or prohibited in the search results. If a query contains multiple search terms, DB2 Text Search also allows you to add boost modifiers to some of them to give them stronger weighting in the search. Refer to the DB2 documentation for more details. 19.7 SUMMARY OF TEXT SEARCH ADMINISTRATION COMMANDS Table 19.2 summarizes the most important commands for administrating the DB2 Net Search Extender and DB2 Text Search. The commands are grouped according to their scope, such as the entire DB2 instance, a single database, or a specific table or index.
19.7 Summary of Text Search Administration Commands Table 19.2 595 Summary of Text Index Administration Commands NSE / db2ts Command Comment Instance Level Commands db2text START db2ts START FOR TEXT db2text STOP db2ts STOP FOR TEXT db2ts CLEANUP FOR TEXT Start text search service Stop text search service Removes obsolete text search objects Database Level Commands db2ts / db2text ENABLE DATABASE FOR TEXT CONNECT TO <dbname> Enable full-text search db2ts / db2text DISABLE DATABASE FOR TEXT CONNECT TO <dbname> Disable full-text search db2text CONTROL {clear|list} ALL LOCKS FOR {DATABASE <dbname> | INDEX <indname>} List or remove text index locks db2ts CLEAR COMMAND LOCKS [for index <indname>] FOR TEXT Force the removal of text index locks Table/Index Level Commands db2ts / db2text CLEAR EVENTS FOR INDEX <indname> FOR TEXT Delete events from event tables db2ts / db2text CREATE INDEX <indname> FOR TEXT ON <tabname> (<colname>) Add a text index db2ts / db2text DROP INDEX <indname> FOR TEXT Remove a text index db2ts / db2text ALTER INDEX <indname> FOR TEXT… Change index update options db2ts / db2text UPDATE INDEX <indname> FOR TEXT Manually invoke index update db2text UPDATE INDEX <indname> FOR TEXT REORGANIZE Reorganize the text index
596 19.8 Chapter 19 Performing Full-Text Search XML FULL-TEXT SEARCH IN DB2 FOR Z/OS To perform XML full-text search in DB2 for z/OS, use the IBM OmniFind Text Search Server for DB2 for z/OS. This text search functionality is very similar to DB2 Text Search for DB2 on Linux, UNIX, and Windows, which is also based on Omnifind (see section 19.6). The XML structure in the XML data is indexed in the IBM OmniFind Text Search Server for DB2 for z/OS after parsing the data through an XML parser. Then you can use the CONTAINS function and the supported XPath query syntax to perform XML full-text search. For example, the DB2 Text Search queries in Figure 19.38, Figure 19.40, Figure 19.41, and Figure 19.43 are also supported by the IBM OmniFind Text Search Server for DB2 for z/OS. The functionality and syntax of search conditions are identical. Differences exist in the installation and administration of the Text Search Server. For example, creating and updating full-text indexes for DB2 for z/OS is performed via stored procedures such as SYSPROC.SYSTS_CREATE and SYSPROC.SYSTS_ UPDATE. Links to further information are provided in Appendix C. 19.9 SUMMARY Unlike data-oriented XML, content-oriented XML documents represent predominantly textual information, such as emails, news, contracts, patents, or web content. Processing such XML data often requires full-text search. The existing DB2 Net Search Extender as well as the new DB2 Text Search capabilities allow you to create full-text indexes over XML documents or other text documents that are stored in XML columns or LOB and character columns. When you insert, update, or delete an XML document, text indexes are not updated immediately. Instead, the data changes are captured in a staging table and periodically used to refresh the text index. This is known as asynchronous index maintenance. You can configure how frequently a text index gets refreshed. When a text index is in place you can use text search predicates in SQL, XQuery, and SQL/XML queries. The search predicates can range from simple search terms to complex predicates with Boolean operators, fuzzy search, stemming, linguistic search, proximity search, and other options. A single query can contain text search predicates as well as regular relational and XML predicates at the same time.
C H A P T E R 20 Understanding XML Data Encoding I n this chapter you will learn • How DB2 identifies and handles the encoding of your XML data • Best practices to avoid encoding problems DB2 for z/OS and DB2 for Linux, UNIX, and Windows always store XML data in UTF-8 Unicode, irrespective of the original code page of your XML data or application and regardless of the database code page. If your XML data is in a different encoding, DB2 will automatically perform code page conversion to UTF-8 before storing it. When you retrieve XML data from DB2 you can obtain it in UTF-8 or automatically in the code page of your application. In most cases all code page conversion is fully transparent to your application, especially if you stick to using Unicode for your data, application, and database code page. If you use non-Unicode code pages, there are certain situations in which code page conversion occurs and can lead to data loss. Data loss can occur when characters in one code page cannot be represented in another. There is a significant trend towards Unicode in the industry. For example, the application code page of Java and .NET applications is always Unicode (UTF-16). If you want to use the pureXML capabilities in DB2 9.1 for Linux, UNIX, and Windows, the database code page must be UTF-8. To create a UTF-8 Unicode database with the name demo in DB2 9.1, issue the following command: CREATE DATABASE demo USING CODESET utf-8 TERRITORY us Since DB2 9.5 for Linux, UNIX, and Windows, the default code page for a new database is always UTF-8, so you can safely omit the USING CODESET clause. DB2 9.5 also allows you to use pureXML in non-Unicode databases. 597
598 Chapter 20 Understanding XML Data Encoding DB2 for z/OS allows you to define and use XML columns regardless of the encoding scheme of the database or table space. For example, you can add an XML column to an EBCDIC, ASCII, or UNICODE table. The encoding schema of the table space does not affect the storage of the XML documents, which are always stored in UTF-8. All other data of the same table remains EBCDIC, ASCII, or UNICODE. Many XML encoding considerations are identical for DB2 for z/OS and DB2 for Linux, UNIX, and Windows. In the remainder of this chapter we refer to “DB2” without indication of platform unless there is platform-specific behavior or syntax that deserves attention. WHAT IS UNICODE? Unicode is a universal character encoding standard for the representation of text in computer systems. Unicode assigns a universally unique number to every character in every language, independent from any hardware, operating system, software, or programming language. Before the advent of Unicode, hundreds of different and often conflicting encoding schemes were used. Unicode defines a single consistent encoding for all of the world’s characters, including all European alphabets such as Greek and Cyrillic, Middle-Eastern left-to-right scripts, and all commonly used Asian alphabets. Unicode also covers accented characters (such as é and ñ), punctuation marks as well as mathematical and technical symbols, and hieroglyphs. WHAT IS UTF-8? UTF-8, UTF-16, and UTF-32 are the three Unicode Transformation Formats defined by the Unicode standard.They are different ways to represent the same Unicode character codes in bits and bytes.As indicated by their names, they use 8-bit, 16-bit, or 32-bit units. UTF-32 represents every Unicode character as a single 32-bit code unit. UTF-16 is a variable-length encoding, which represents the most commonly used characters in only 16 bit, and all others in pairs of two 16-bit code units.This approach saves space over UTF-32. Finally, UTF-8 encodes any Unicode character in 1 to 4 bytes. The number of bytes used depends on the character. For example, UTF-8 represents the widely used ASCII characters in just one byte (8 bits) each, which saves space over UTF-16. UTF-8 is the default encoding for XML. The predecessor of UTF-16 is USC-2, which is a fixed-length encoding that represents each character in 2 bytes. USC-2 is a subset of UTF-16 and the default encoding in Java applications prior to Java 1.5 (J2SE 5.0), where support for UFT-16 was introduced.
20.1 Understanding Internal and External XML Encoding 20.1 599 UNDERSTANDING INTERNAL AND EXTERNAL XML ENCODING XML differs from other types of data because it can be internally encoded, externally encoded, or both. This difference matters when you move XML data from your application to DB2 (INSERT, UPDATE, IMPORT, or LOAD) and when you retrieve XML data from DB2 (using SELECT or EXPORT). Internally encoded means that the encoding of your XML data can be derived from the data itself, as defined in the XML standard. In contrast, externally encoded means that the encoding of the data is derived from the application code page. When an XML document is inserted, DB2 identifies its encoding. Whether DB2 treats your XML data as internally encoded or externally encoded depends on the data type of the application variables or parameter markers that you use to exchange XML data with DB2. If your application uses character type variables for XML then the data is considered externally encoded; that is, in the application code page. If you use binary application data types then the XML data is considered internally encoded and the application code page is irrelevant. DB2 for Linux, UNIX, and Windows detects the application code page automatically from the operating system. For example, in Linux and UNIX it detects the locale setting. You can use the DB2 registry variable DB2CODEPAGE to override the automatically detected application code page and specify a different application code page. Overriding the automatically detected application code page is rarely required and can cause unpredictable results if you set an inappropriate code page. Therefore we recommend that you do not set the DB2CODEPAGE registry variable unless you have a strong reason for it. 20.1.1 Internally Encoded XML Data The XML standard defines that the internal encoding of XML data is determined based on three items, which may or may not exist in a given XML document. Any XML parser, including the one in DB2, determines the internal encoding based on these items. They are • A Unicode Byte-Order Mark (BOM): A Unicode Byte-Order Mark is a specific sequence of bytes that represents the Unicode special character U+FEFF (“zero-width no-break space”). If a BOM exists, it is located at the very beginning of an XML document. The BOM character is represented differently in UTF-8, UTF-16, or UTF-32. This allows an XML parser (DB2) to recognize the BOM and use it to infer whether the encoding of the document is UTF-8, UTF-16, or UTF-32. Appendix C, Further Reading, contains pointers to more detailed information. • An XML declaration containing an encoding declaration: The XML declaration is an optional line at the beginning of an XML document, and can contain an optional attribute named encoding. This attribute is known as the encoding declaration. For example, the following XML declaration contains an encoding declaration:
600 Chapter 20 Understanding XML Data Encoding <?xml version=”1.0” encoding=”UTF–8” ?> Encoding Declaration XML Declaration DB2 uses the value of the encoding attribute in the XML declaration to determine the encoding of the XML document. Note: If the XML document has a BOM and an encoding declaration, they must match. If the BOM indicates a different encoding than the encoding declaration, DB2 rejects the XML data with the following error: SQL16168N XML document contains an invalid XML declaration. Reason code = "7". To look up reason code “7,” you can issue the command “? SQL16168N” at the DB2 CLP. You will find that reason code “7” means that the specified document encoding was invalid or contradicts the automatically sensed encoding. In that case you should either remove the BOM, or the XML declaration, or both. • An XML declaration without an encoding declaration: Since the encoding declaration is optional, an XML declaration may show the XML version number only, like this: <?xml version="1.0"> If there is such an XML declaration and no BOM or encoding declaration, then DB2 inspects the XML declaration to determine the encoding in the following alternative way. If the XML declaration consists of single-byte ASCII characters, then the encoding of the document is UTF-8. If the XML declaration is in double-byte ASCII characters, the encoding is UTF-16. If an XML document has no BOM and no XML declaration at all, then DB2 interprets the document as UTF-8. 20.1.2 Externally Encoded XML Data Your XML data has an external encoding if you use character data types (instead of binary data types) to hold the data in your application. Beware that externally encoded XML data might also contain an internal encoding, which is the case when an XML document in a character data type contains an encoding declaration. If you try to store externally encoded XML data, DB2 for Linux, UNIX, and Windows checks whether an internal encoding exists. If your XML data has an internal and an external encoding that are not Unicode, the internal encoding must match the external encoding. Otherwise DB2 for
20.3 Using Non-Unicode Databases for XML 601 Linux, UNIX, and Windows rejects the XML data with error SQL16103N. If the external and the internal encoding are Unicode encodings, DB2 ignores the internal encoding. DB2 for z/OS does not enforce consistency of the internal and external encoding. If the internal and external encoding information are different, the external encoding takes precedence although character conversion might have occurred on the data and there might be data loss. Hence, it is strongly recommended to avoid a mismatch between internal and external encoding. 20.2 AVOIDING CODE PAGE CONVERSIONS Avoiding code page conversions helps reduce CPU consumption and prevents unintentional data loss or data truncation. To avoid code page problems, it is recommended to use internally encoded XML data, not externally encoded XML. This means it is recommended to handle XML data in your application in binary data types rather than character data types. For example, when you insert XML data into DB2 and use the function SQLBindParameter() in CLI applications to bind a parameter marker to an XML document, you should use SQL_C_BINARY data buffers rather than SQL_C_CHAR, SQL_C_DBCHAR, or SQL_C_WCHAR. When inserting XML data from Java applications, reading in the XML data as a binary stream (setBinaryStream) is preferred over character strings (setString). Similarly, if your Java application receives XML from DB2 and writes it to a file, code page conversion can occur if the XML is written as non-binary data. 20.3 USING NON-UNICODE DATABASES FOR XML In DB2 9 for z/OS and DB2 9.5 for Linux, UNIX, and Windows you can use any database code page to manage XML data using the pureXML capabilities. It does not have to be Unicode. While Unicode is recommended, using a non-Unicode database code page can be desirable due to special application requirements. In some cases, you might have existing databases in a non-Unicode code page and you might want to add XML data to such a database. DB2 always stores and processes XML data in Unicode (UTF-8) even if the database code page (or in DB2 for z/OS the table space encoding) is different. XML parsing, storage, serialization as well as XML query execution and comparisons are all performed in UTF-8. In contrast, SQL data is always stored and processed in the database code page. Therefore, a non-Unicode database causes code page conversion whenever a query combines SQL and XML data, or casts SQL type data to XML, or XML type data to SQL types. This code page conversion can only be avoided if the database code page is UTF-8, the same for SQL and XML data. Figure 20.1 illustrates code page considerations with XML data in DB2. If your application uses character type variables to hold XML data, the data will be converted from the application code page to the database code page upon insert into DB2 (indicated by arrow 1). Then DB2 converts
602 Chapter 20 Understanding XML Data Encoding the XML data from the database code page to UTF 8 for XML parsing and storage (arrow 2). Similarly, if you use character type application variables for retrieving XML data, the XML will be converted from UTF-8 to the database code page and then to the application code page. These code page conversions between application code page and database code page can be avoided in two ways: • Use binary (or XML) instead of character type variables for XML data in your application. This is illustrated by arrow 3 in Figure 20.1. • Use the same code page for your database and your application. This avoids conversion between database and application code page. In both cases, DB2 still converts the XML data to UTF-8 if it isn’t in UTF-8 already. DB2 Application DB2 database database code page application code page 1 2 Character data Binary data type XML data type Figure 20.1 3 pureXML Storage UTF-8 Code page conversions If you avoid code page conversions between application and database code pages, then you avoid risking data loss. It is possible that characters in your application code page cannot be represented in the database code page, or vice versa. In this case DB2 introduces substitution characters into the data and issues an error or warning. The next section illustrates these issues in various examples. 20.4 EXAMPLES OF CODE PAGE ISSUES Let’s look at some examples of code page conversion issues. 20.4.1 Example 1: Chinese Characters in a Non-Unicode Code Page ISO-8859-1 Assume a database in DB2 for Linux, UNIX, and Windows has been created with the nonUnicode code page ISO-8859-1: CREATE DATABASE test USING CODESET ISO-8859-1 TERRITORY us;
20.4 Examples of Code Page Issues 603 An application uses a character data type (for example, SQL_C_DBCHAR in CLI, or setString in Java) to insert the document in Figure 20.2, which contains Chinese characters. <book> <title>Romance of the Three Kingdoms</title> <nativeTitle> </nativeTitle> <author> <firstname>Lou</firstname> <lastname>Guanzhong</lastname> <nativeName> </nativeName> </author> </book> Figure 20.2 XML document with Chinese characters This document will be converted from whatever the application code page is to the database code page, which is ISO-8859-1. The Chinese characters cannot be represented in ISO-8859-1 and will be replaced by a substitution character. For ISO-8859-1, the substitution character is the hexadecimal hex character 0x1A, which is usually displayed as a question mark (‘?’). Hence, the document will be stored as shown in Figure 20.3. <book> <title>Romance of the Three Kingdoms</title> <nativeTitle>????</nativeTitle> <author> <firstname>Lou</firstname> <lastname>Guanzhong</lastname> <nativeName>???</nativeName> </author> </book> Figure 20.3 Document with substitution characters As you see in Figure 20.3, the native title and the native author name are lost. The DB2 Command Line Processor (CLP) shows “?” instead of the substitution character 0x1A. You can avoid this data loss in two ways: • Use binary instead of character type variables for XML in the application. • Use UTF-8 as the database code page, or any other code page that can represent Chinese characters. 20.4.2 Example 2: Fetching Data from a Non-Unicode Code Database into a Character Type Application Variable A database in DB2 for Linux, UNIX, and Windows has been created with the non-Unicode code page ISO-8859-1. An application has used a binary application variable to insert the document
604 Chapter 20 Understanding XML Data Encoding with Chinese characters (see Figure 20.4). The Chinese characters are preserved. The table name is books and the XML column name is doc. Assume the query in Figure 20.4 is used to fetch XML data into a character type application variable: SELECT XMLQUERY('$DOC/book/nativeTitle') FROM books Figure 20.4 Retrieving the Element nativeTitle Since the query result is fetched into a character type variable, it is first converted from UTF-8 (XML storage) to the database code page, which cannot represent Chinese characters. Thus, this query fails with the following error: SQL20412N Serialization of an XML value resulted in characters that could not be represented in the target encoding. To avoid this error, use a binary instead of a character type variable to bind out the XML value to the application, or use a database code page that can represent Chinese characters, such as UTF-8. 20.4.3 Example 3: Encoding Issues with XMLTABLE and XMLCAST This example uses the same database scenario as in example 2. Assume the query in Figure 20.5 is submitted to the database to retrieve the native title and author information. The XMLTABLE function in this query extracts XML values and converts them to SQL VARCHAR values in the database code page (ISO-8859-1). Since Chinese characters cannot be represented in this code page, the VARCHAR values will contain the substitution character instead. This substitution also applies if XMLCAST is used to convert the data to VARCHAR. To avoid this problem, create the database in a code page that can represent the Chinese characters. SELECT x.* FROM books, XMLTABLE('$DOC/book' COLUMNS NativeTitle VARCHAR(50) NativeAuthor VARCHAR(50) Figure 20.5 PATH '/nativeTitle', PATH '//nativeName') AS x Return the native title and author as VARCHAR values
20.4 Examples of Code Page Issues 20.4.4 605 Example 4: Japanese Literal Values in a Non-Unicode Database Assume a database has the non-Unicode code page ISO-8859-1 and that the query in Figure 20.6 is issued. Note that it contains the Japanese character “ ” as a literal value in the XML predicate. SELECT * FROM items WHERE XMLEXISTS('$DOC/item[name = " Figure 20.6 "] ') A query with a Japanese character The query text will be converted to the database code page. As a result, the Japanese character will be replaced by the substitution character 0x1A, which is not a valid character for an XQuery expression. Hence, DB2 returns the following error, which would not occur in a Unicode database: SQL16002N: An XQuery expression has an unexpected token "0x1A" following "=". 20.4.5 Example 5: Data Expansion and Shrinkage Due to Code Page Conversion A database has been created with the Unicode encoding UTF-8. The database contains XML documents that include Korean characters. A Java application connects to the database to read XML data. The application code page is UTF-16. The application uses the method ResultSet.getString to bind XML data from the database to a String type variable. Since String is a character data type, code page conversion from the DB2 storage format (UTF-8) to the application code page (UTF-16) is performed. There is no data loss because UTF-16 can represent all the same characters as UTF-8 (and vice versa). However, some Korean characters are represented by 3 bytes in UTF-8; that is, 3 × 8 bit, while the same characters may use only 16 bit in UTF-16. In other words, the same character requires more space in UTF-8 than in UTF-16. Hence, when you retrieve a Korean character string from a UTF-8 database to a String variable in a UTF-16 application, the resulting string length in the application (in bytes) might be smaller than in the database. Conversely, if you hold a Korean character string in a character type variable in your UTF-16 application and the length is, for example, 20 bytes, inserting this string into a CHAR(20) column in a UTF-8 database may fail. The reason is that the same string might require more bytes in UTF-8 than in UTF-16. If the same character string was part of an XML document that is inserted into an XML column, the data expansion does not lead to a failure because there is no length restriction associated with an XML column (other than the 2GB maximum size per document).
606 Chapter 20 Understanding XML Data Encoding 20.5 AVOIDING DATA LOSS AND ENCODING ERRORS IN NON-UNICODE DATABASES The previous section has shown some of the problems that can occur when character type variables are used to insert (bind in) or fetch (bind out) XML data from a non-Unicode database. Encoding errors and data loss can happen. Again, using a Unicode database and handling XML with binary data types in your application is the best way to avoid these problems. If you have to use a non-Unicode database then you can still avoid many problems by using binary instead of character types in your application when you insert or retrieve XML data. In DB2 9.5 and 9.7 for Linux, UNIX, and Windows you can use the database configuration parameter ENABLE_XMLCHAR to prevent applications from inserting XML data via character data types. By default, this parameter is set to ON [YES] to allow the use of character types. Use the following command to block any inserts of character type data into XML columns: db2 UPDATE DB CFG FOR <dbname> USING enable_xmlchar off Subsequently, XML inserts with character type variables or parameter markers are rejected with error message SQL20429N: SQL20429N The XML operation is not allowed on strings that are not FOR BIT DATA on this database. This error ensures that data loss due to character substitution cannot occur upon insert. Applications will need to use binary data types to avoid this error and to avoid character substitutions. When ENABLE_XMLCHAR is set to OFF, you cannot insert XML data in plain text through the DB2 CLP. 20.6 SUMMARY The character representation of an XML document can have an internal encoding, an external encoding, or both. The internal encoding is determined by the document itself, through an encoding declaration or a Unicode byte order mark. The external encoding of a document is the same as the application code page, if the application code holds XML data in character (string) type variables. If an application holds XML data in binary type variables then there is no external encoding, only an internal encoding. It is recommended to create DB2 databases with UTF-8 as the database code page. DB2 always stores XML data in UTF-8 encoding, even if the database code page is not UTF-8. For your application it is recommended to use binary data types for XML data to avoid external encoding. External encoding leads to additional code page conversion when your application exchanges
20.6 Summary 607 XML data with the DB2 server—except when the application code page and the database code page are UTF-8. Code page conversion can lead to data loss if characters in one code page cannot be represented in another code page. Understanding XML encoding concepts is important for XML application development and the passing of XML data through APIs between database server and applications.
This page intentionally left blank
C H A P T E R 21 Developing XML Applications with DB2 A pplication development encompasses all the tasks that go beyond the mere creation and maintenance of database objects. Typical application development tasks include • Developing application code in a programming language such as Java or COBOL and managing all interactions with the database through APIs such as JDBC. • Designing and maintaining XML artifacts such as XML Schemas and XSLT style sheets with XML application development tools. • Developing database stored procedures and user-defined functions (see Chapter 18). • Writing queries in SQL, XQuery, or SQL/XML as well as writing INSERT, DELETE, and UPDATE statements. (See Chapters 6 through 9 and Chapter 12). XML application development often deals with moving XML data between the database server and a client application. Codepage conversion issues can arise when the database and the application have different codepages. The discussion in this chapter assumes that you are familiar with key concepts from Chapter 20, Understanding XML Data Encoding, such as XML declarations, encoding declarations, internal encoding, and external encoding. You can use a wide variety of programming languages and APIs to write DB2 pureXML applications, including the following: • Assembler • C or C++ (embedded SQL or DB2 CLI) • COBOL • Java (JDBC or SQLJ) 609
610 Chapter 21 Developing XML Applications with DB2 • C# and Visual Basic (.NET) • Perl • PHP • PL/1 • Ruby, and the Ruby on Rails framework In this chapter we discuss application programming with DB2 pureXML for a subset of these languages and APIs. In particular, this chapter covers the following topics: • The value of DB2 pureXML for application development (section 21.1) • Parameter markers and host variables in SQL/XML (section 21.2) • Java applications for DB2 pureXML (section 21.3) • .NET applications for DB2 pureXML (section 21.4) • CLI applications for DB2 pureXML (section 21.5) • COBOL, PL/1, and C applications for DB2 pureXML (section 21.6) • PHP applications for DB2 pureXML (section 21.7) • Perl applications for DB2 pureXML (section 21.8) • XML application development tools (section 21.9) Each section assumes that the reader is already familiar with the programming language or API that is being discussed. Emphasis is placed on the special considerations for XML manipulation and interaction with a DB2 pureXML database. A more general introduction to the listed languages and APIs is beyond the scope of this book. The code samples in this chapter are based on the customer table: CREATE TABLE customer(cid INTEGER, info XML) 21.1 THE VALUE OF DB2 PUREXML FOR APPLICATION DEVELOPMENT As an application developer, you will find that the pureXML features in DB2 provide significant value for XML application development. For example, rapid prototyping, flexibility, and avoiding XML parsing in the application are common benefits. 21.1.1 Avoid XML Parsing in the Application Layer Traditionally, applications that need to manipulate XML documents often read full documents from the file system or CLOB columns into memory. They then use an XML Document Object Model (DOM) parser to gain access into the documents. The drawback of DOM parsing is that the entire XML document is represented in memory as a tree. This tree can be five to ten times
21.1 The Value of DB2 pureXML for Application Development 611 larger than the original XML file, which can be acceptable if you process small documents, one or a few at a time. However, memory consumption poses a significant problem if you manipulate large documents or many at the same time. Additionally, the CPU consumption of XML parsing is a common performance problem. Also, DOM manipulation requires specific skill and results in additional non-trivial application code that needs to be maintained over time. As an alternative, Simple API for XML (SAX) parsers and Streaming API for XML (StAX) parsers alleviate the memory consumption problem because they are event- and stream-based interfaces that give the application access to only a part of the XML document at a time. They are faster and consume less memory than DOM parsers because they do not hold the entire document in memory. However, the CPU overhead remains. Navigating through an XML document with a SAX or StAX parser requires extra coding because you cannot easily go backwards in a stream of events. Hence, it’s the application’s responsibility to intelligently buffer any part of the document that it might need to revisit. You should avoid this complexity as much as possible. Another disadvantage of DOM-, SAX-, and StAX-based XML document manipulation is that these APIs allow you to process only one document at a time. Querying or updating many XML documents based on specific search criteria requires additional coding and processing overhead. With DB2 pureXML, applications can often avoid XML parsing, because DB2 parses XML documents only once, at insert time, and stores them in a parsed hierarchical format. The parsed storage allows you to extract or update document fragments or individual values without having to parse the XML data in your application. Applications send appropriate XML query or update statements to DB2 instead of fetching and parsing full documents. As a result there is less application code, reduced application complexity, and higher end-to-end performance. Additionally, DB2 can efficiently execute XML queries and updates over large collections of XML documents without XML parsing and without additional application code. In particular, DB2’s XML indexes can evaluate search conditions and find matching documents quickly. Let’s consider further examples of how DB2 pureXML avoids XML processing in the application code: • Assume your application receives an XML document and needs to insert specific values from the document into relational columns of a DB2 table. You could parse the XML document in the application, extract the values, and issue a traditional INSERT statement to DB2. However, letting DB2 do this work is often easier and more efficient. Simply issue an INSERT statement with an XMLTABLE function and provide the document as a parameter. • Assume your application receives a very large document and you want to split it into smaller documents and insert those into DB2. Although you could split the document with an XML parser in your application, it is again easier and more efficient to let DB2 do the work in an INSERT statement with an XMLTABLE function.
612 Chapter 21 Developing XML Applications with DB2 • Assume your application needs to read certain values from one or several XML documents that are stored in DB2. You should use DB2’s SQL/XML and XQuery features instead of performing XML parsing of full documents in the application. • Assume you need to generate XML documents from data in several relational tables. You could write custom application code to read the relational values and construct XML data. However, it is often faster and simpler to use declarative SQL/XML construction queries and avoid the extra coding work in your application. In situations where applications still require access to XML data through DOM or SAX APIs, they can use the new JDBC 4.0 features, which are covered later in this chapter. 21.1.2 Storing Business Objects in an Intuitive Format If application data represents business objects, such as insurance claim forms, then it is often beneficial to keep all data items that comprise a particular claim together instead of spreading them over a set of tables. Often the individual data items of a claim form have no valid business meaning on their own and can only be interpreted in the context of the complete form. Normalizing the claims across dozens of relational tables means that applications (and application developers) deal with a complex and unnatural fragmentation of the business data. This fragmentation often increases application complexity, development costs, and the chance for errors. It also introduces the need for multi-way join queries to reassemble the original business objects. DB2 pureXML allows you to manage complex business objects as cohesive and distinct documents while still capturing all the relationships between the data items that comprise the business object. Representing each claim form (business object) as a single XML document in a single row of a table provides a very intuitive storage model for the application developer and data analyst. The same applies to other business objects, such as orders, trades, tax returns, travel reservations, or medical records. 21.1.3 Rapid Prototyping Designing a data model and a corresponding relational database schema can be a timeconsuming and complicated task that is subject to a variety of design decisions. How far do you normalize the data? What should be the join keys between the tables? Where can you assume a one-to-one relationship between data items and where do you have to account for one-to-many relationships? Will each of your products belong to just one category, or to multiple? Which columns and data types will you need in a given table? Will the product identifiers always be numeric or do you need to prepare for alphanumeric IDs? In the early stages of application prototyping there is often incomplete information to make all these decisions. Additional information and requirements tend to keep trickling in so that the initial data model and relational schema is subject to frequent early changes. These changes take
21.2 Using Parameter Markers or Host Variables 613 time and require SQL statements in the application prototype to be modified and tested. This overhead is undesirable when the goal of the prototyping project is not (yet) to produce an optimal data model but to showcase requested application functionality quickly. With DB2 pureXML, many of the relational design decisions can be postponed. You can choose XML as the data format for your application prototype and store evolving XML formats in a column of type XML in a DB2 database. The usage of a fixed XML Schema is not required. You can build an application prototype quickly without having to define data types or decide on one-toone and one-to-many relationships. XML gives you the flexibility to leave these things undefined at the database level. As a result, you can prototype more rapidly and be more resilient to changing requirements. Note, however, that the flexibility of XML for rapid prototyping does not mean that you can or should develop production applications without carefully thinking about data design and database schema design. Let’s consider an example. If you model information about customers and their addresses and phone numbers, then eventually you should define precisely whether a customer can have one or more than one phone number. This decision affects applications that consume the data. The key benefit of XML is that when you make this decision, changing a one-to-one relationship to a one-to-many relationship is much easier in an XML Schema than in a relational schema. Using XML allows you to make this change without modifying or adding any database tables. To be very clear, the benefit of XML is that many types of schema changes are easier and less costly than in a fully relational database design. Using XML does not imply that you can ignore design decisions indefinitely. 21.1.4 Responding Quickly to Changing Business Needs The same flexibility that enables more rapid prototyping also enables you to react faster to change requests. Data fields can be added or removed, data types can be changed, one-to-one relationships can evolve to one-to-many, all without any modifications to the underlying database schema. XML queries are very resilient to such changes. For example, an XPath expression such as /customerinfo[phone = "123-456-7890"] is independent from whether there is a oneto-one or a one-to-many relationship between customers and phone numbers. In general DB2 pureXML reduces the overhead incurred by schema and application changes as compared to a fully relational database schema. 21.2 USING PARAMETER MARKERS OR HOST VARIABLES Very short database queries as well as INSERT, UPDATE, and DELETE statements can execute so fast that the time to compile and optimize them is a substantial portion of their total response time. Thus, it is useful to write statements with parameter markers or host variables instead of
614 Chapter 21 Developing XML Applications with DB2 literal predicate values. Parameter markers and host variables are placeholders for literal values and can be replaced by actual values without having to recompile the statement. This mechanism allows you to compile (“prepare”) a statement only once and pass different literal predicate values for each execution. Host variables are regular programming language variables that are referenced within SQL statements. Host variables are used in embedded SQL applications written for example in C, COBOL, PL/1, or Assembler. Parameters markers are not variables, but there are specific API functions to bind values of programming language variables to parameter markers. You cannot use SQL-style parameter markers or traditional host variables in XQuery. However, the SQL/XML functions XMLQUERY, XMLTABLE, and XMLEXISTS allow you to bind SQL parameter markers or host variables to XQuery variables in an XQuery expression. This is recommended for applications with short and repetitive queries. Figure 21.1 shows an XQuery and an SQL/XML query with hardcoded literal values in their predicates. In contrast, the queries in Figure 21.2 use a parameter marker (?) and a host variable (:hostvar), respectively, to avoid compilation of the query for each execution with a different search value. You should cast the parameter marker or host variable to an appropriate data type. xquery for $t in db2-fn:xmlcolumn('CUSTOMER.INFO')/customerinfo where $t/addr/zip = 12345 return $t SELECT info FROM customer WHERE XMLEXISTS('$INFO/customerinfo/addr[zip=12345]') Figure 21.1 Two XML queries with hard-coded literal values in the predicate SELECT info FROM customer WHERE XMLEXISTS('$INFO/customerinfo/addr[zip=$x]' PASSING CAST(? AS integer) AS "x") SELECT info FROM customer WHERE XMLEXISTS('$INFO/customerinfo/addr[zip=$x]' PASSING CAST(:hostvar AS integer) AS "x") Figure 21.2 SQL/XML queries with parameter marker and host variable
21.3 Java Applications 21.3 615 JAVA APPLICATIONS The Java programming language and its database interface JDBC are very popular choices for XML application development. IBM provides a single driver that supports both the JDBC and the SQLJ interfaces of the Java language. This driver is called IBM Data Server Driver for JDBC and SQLJ, also known as JCC (Java Common Client). It is a JDBC type 2 and type 4 driver and can connect to DB2 on all platforms. The type 2 driver is deprecated and you should use the type 4 driver. An installation of DB2 9.1 for Linux, UNIX, and Windows includes JCC 3, which supports JDBC 3.0. DB2 9.5 for Linux, UNIX, and Windows includes JCC 4, which supports JDBC 3.0 and a subset of JDBC 4.0. Note that JCC 4 and JDBC 4.0 require Java 6.0. The JAR files db2jcc.jar and db2jcc4.jar are also included in the latest DB2 Client or can be downloaded at http://www.ibm.com/software/data/db2/java/. Table 21.1 provides a summary of the JCC drivers. Table 21.1 DB2’s Support for JDBC 3.0 and 4.0 DB2 Version JCC Driver JDBC Support JAR File Minimum Java Level Required DB2 9.1 JCC 3 JDBC 3.0 db2jcc.jar 1.4 DB2 9.5 and higher JCC 4 JDBC 3.0 and 4.0 db2jcc4.jar 6.0 IBM’s JCC 3 driver provides the proprietary XML data type DB2Xml because the JDBC 3.0 standard does not define an XML data type. The JDBC 4.0 standard introduces an XML data type called SQLXML, which is supported by the JCC 4 driver (see Table 21.2). Table 21.2 XML Data Type Support in JCC 3 and JCC 4 JCC Driver JDBC Java Interface for XML Data Java Constant for the XML Data Type JCC 3 JDBC 3.0 com.ibm.db2.jcc.DB2Xml java.sql.Types.OTHER JCC 4 JDBC 4.0 java.sql.SQLXML java.sql.Types.SQLXML 21.3.1 XML Support in JDBC 3.0 To retrieve XML data from a DB2 database into your JDBC 3.0 application, use the Java standard interface ResultSet as you normally would for relational data. The interface ResultSet offers various getter methods to retrieve XML data from the current result row into an application variable. Table 21.3 lists those methods together with the data type and encoding of their output. Remember that UCS-2 is a subset of UTF-16. The methods in Table 21.3 do not add an encoding declaration to the retrieved XML data and are also available in JCC 4.
616 Table 21.3 Chapter 21 Developing XML Applications with DB2 JDBC 3.0 and the DB2Xml Data Type Getter Methods on ResultSet Application Data Type Encoding (without Declaration) getAsciiStream() InputStream ASCII getBytes() byte[] UTF-8 getBinaryStream() InputStream UTF-8 getString() String UCS-2 getCharacterStream() Reader UCS-2 getObject() DB2Xml None (DB2Xml object) The method getObject() retrieves XML data into an object of type DB2Xml. The benefit of the DB2Xml object, as compared to a generic ResultSet object, is that it offers a wider range of getter methods (see Table 21.4). In particular, the DB2Xml interface includes methods that generate XML declarations with an encoding attribute for the retrieved XML data, as well as methods that force the XML data to be converted to a specified target encoding. For example, the methods getDB2String() and getDB2XmlString() return the XML data in the same encoding, UCS2, but the latter adds the appropriate encoding declaration to the XML document. While getDB2BinaryStream() always returns XML data in UTF-8 format without an encoding declaration, the method getDB2XmlBinaryStream()takes a string argument that specifies which encoding to produce. Table 21.4 DB2Xml Getter Methods, Data Types, and Encoding Specifications JDBC Interface Getter Method Output Data Type XML Encoding Declaration Added DB2Xml getDB2AsciiStream() InputStream None getDB2BinaryStream() InputStream None getDB2Bytes() byte[] None getDB2CharacterStream() Reader None getDB2String() String None getDB2XmlAsciiStream() InputStream ASCII getDB2XmlBinaryStream (Encoding) InputStream Specified by the Encoding parameter
21.3 Java Applications Table 21.4 617 DB2Xml Getter Methods, Data Types, and Encoding Specifications (Continued) JDBC Interface Getter Method Output Data Type XML Encoding Declaration Added DB2Xml getDB2XmlBytes(Encoding) byte[] Specified by the Encoding parameter getDB2XmlCharacterStream() Reader ISO-10646-UCS-2 getDB2XmlString() String ISO-10646-UCS-2 Figure 21.3 shows usage examples for some of the getter methods in Table 21.3 and Table 21.4. The last three Java calls in Figure 21.3 call getter methods on the DB2Xml object that was previously retrieved from the ResultSet. import com.ibm.db2.jcc.DB2Xml; ResultSet rs = statement.executeQuery( "SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE XMLEXISTS('$INFO/customerinfo[addr/zip = 95123]') "); rs.next(); //***** getter methods on ResultSet: ***** //retrieve XML into a UTF-8 byte array: byte[] xmlBytes = rs.getBytes(1); // 1 is the column index //retrieve XML as a UCS-2 string variable: String xmlString = rs.getString(1); //retrieve XML as a DB2Xml object: DB2Xml xmlObj = (DB2Xml) rs.getObject(1); //***** getter methods on DB2Xml: ***** //retrieve XML from the DB2Xml object as a string //with encoding declaration for UCS-2: String xmlString = xmlObj.getDB2XmlString(); //retrieve XML from the DB2Xml object as a UTF-8 binary stream: InputStream inStream = xmlObj.getDB2BinaryStream(); //retrieve XML from the DB2Xml object as a binary stream, //converted to the target encoding EUC_JP: InputStream inStream = xmlObj.getDB2XmlBinaryStream("EUC-JP"); Figure 21.3 JDBC methods to retrieve XML data from a query result set
618 Chapter 21 Developing XML Applications with DB2 If your application does not need to manipulate the XML data in character string format, you should retrieve XML as binary data with methods such as getBytes(), getBinaryStream(), getDB2Bytes(), or getDB2XmlBinaryStream("UTF-8"). Using these methods avoids unnecessary conversion of the XML data from UTF-8 to UTF-16. An object of type DB2Xml cannot be updated or used to update an XML value in the database. If you want to update or insert data into an XML column, use one of the setter methods of the interface PreparedStatement. Table 21.5 lists these methods and their input data types. The method setSQLXML is not available on PreparedStatement until JDBC 4.0 but listed here already for completeness. Table 21.5 Methods to Insert or Update XML Data JDBC Interface Setter Method Input Data Type PreparedStatement setAsciiStream() setBinaryStream() setBlob() setBytes() setCharacterStream() setClob() setString() setObject() InputStream InputStream Blob byte[] Reader Clob String byte[], Blob, Clob, DB2Xml, InputStream, Reader, String SQLXML (new in JDBC 4.0) setSQLXML The code sample in Figure 21.4 shows how to insert XML data from a file into the info column of the customer table using the setter method setBinaryStream(). String sql = "INSERT INTO customer(cid, info) VALUES (?,?)"; PreparedStatement stmt = connection.prepareStatement(sql); File binFile = new File("customer1013.xml"); InputStream stream = new FileInputStream(binFile); stmt.setInt(1, 1013); stmt.setBinaryStream(2, stream, (int) binFile.length()); stmt.execute(); Figure 21.4 Inserting an XML document from a Java application It is recommended that you send XML data to the database server as binary data, using the methods setBinaryStream(), setBlob(), or setBytes(). Binary data is treated as internally
21.3 Java Applications 619 encoded data and not converted to the codepage of the database. If you send XML data to the database server as character data using methods such as setCharacterStream(), setClob(), or setString(), then the data is externally encoded. Externally encoded data can have an internal encoding. This means that the XML data might be sent to the database server as character data but contains an encoding declaration or a Unicode Byte-Order Mark (BOM). If the external and internal encodings are incompatible, DB2 for Linux, UNIX, and Windows raises an error. DB2 for z/OS is more lenient and ignores the internal encoding if it is incompatible with the external encoding. Both JCC 3 and JCC 4 also include methods to register and remove XML Schemas in the schema repository of a DB2 database. These methods are discussed in section 16.4.3 Registering XML Schemas from Java Applications via JDBC. 21.3.2 XML Support in JDBC 4.0 One of the main new features introduced in JDBC 4.0 is the addition of an XML data type called SQLXML to match the XML type defined by the SQL standard. The new interface to represent XML data is java.sql.SQLXML. In JDBC 4.0 the column type of an XML column is reported as java.sql.Types.SQLXML. Other interfaces such as ResultSet and PreparedStatement are enhanced with new getter and setter methods. Remember that you need the IBM JCC 4 driver and Java 6.0 to use JDBC 4.0 functions. To obtain an object of type SQLXML, call the new method ResultSet.getSQLXML(column) and specify the XML column name or index as a parameter. Table 21.6 shows all getter methods available on the interface SQLXML and whether serialization of the XML data takes place. In contrast to the DB2Xml retrieval methods, none of the SQLXML getter methods add an encoding declaration to the retrieved XML data. One of the key methods of the new interface is getSource(), which allows you to directly access the XML data via DOM, SAX, or StAX parser interfaces, or any other class that implements javax.xml.transform.Source. Table 21.6 Methods to Retrieve XML Data from an SQLXML Object JDBC Interface Getter Method Data Type Encoding Serialization SQLXML getBinaryStream() InputStream UTF-8 Yes getCharacterStream() Reader UCS-2 Yes getString() String UCS-2 Yes getSource(Source.class) DOMSource SAXSource StAXSource none No
620 Chapter 21 Developing XML Applications with DB2 An example of the getSource() method is shown in Figure 21.5. The sample code retrieves the XML column info for the customer with a cid value of 1000. On the ResultSet object, it fetches the first result row with resultSet.next(). Then it retrieves the XML document as a SAXSource and creates a SAX parser for it. The same is possible for DOM and StAX parsers and demonstrated in section 21.3.3. ResultSet resultSet = statement.executeQuery( "SELECT info FROM customer WHERE cid=1000"); resultSet.next(); // retrieve an SQLXML object from the ResultSet SQLXML sqlxml = resultSet.getSQLXML(1); // 1 is column index // create a SAX parser from the SQLXML object SAXSource source = sqlxml.getSource(SAXSource.class); XMLReader reader = source.getXMLReader(); // configure parser and start parsing ContentHandler myHandler = ...; reader.setContentHandler(myHandler); reader.parse(source.getInputSource()); Figure 21.5 Retrieving XML Data into a SAX parser in JDBC 4.0 Just like JDBC 3.0, XML inserts and updates in JDBC 4.0 are performed with setter methods on the PreparedStatement interface, as listed in Table 21.5. JDBC 4.0 adds one new setter method called setSQLXML to PreparedStatement. This method allows you to bind an object of type SQLXML to a parameter marker for update or insert into an XML column. The SQLXML object itself can be set with the setter methods listed in Table 21.7. In particular, you can use the method setResult to assign a DOM, SAX, or StAX representation of an XML document to the SQLXML object. Thus, if your application manipulates XML documents in one of these common formats, it does not need to serialize the XML data to its textual representation before using the XML data in an INSERT or UPDATE statement. Table 21.7 JDBC Interface SQLXML Methods to Insert and Update XML Data from an SQLXML Object Setter Method Input Data Type Encoding Serialized Input data Internal External External None Yes Yes Yes No setBinaryStream() OutputStream setCharacterStream() Writer setString(String) String setResult(Result.class) DOMResult SAXResult StAXResult
21.3 Java Applications 621 The code sample in Figure 21.6 prepares an INSERT statement for the customer table and creates an SQLXML object that will be inserted. The code shows how two of the four methods in Table 21.7 can be used to set the SQLXML object. If you call setCharacterStream() on the SQLXML object you obtain a Writer that you can work with to assign or assemble the new document as a character string. Alternatively, if you call setResult(DOMResult.class) you obtain a DOMResult, which allows you to assign or construct a DOM tree to define the document that is inserted into the DB2 table. No matter which way you set the SQLXML object, call setSQLXML on the PreparedStatement to assign the SQLXML object to the parameter marker for the XML column. The next section provides another coding example with JDBC 4.0. Further details can also be found at http://java.sun.com/javase/6/docs/ api/java/sql/ SQLXML.html. String sql = "INSERT INTO customer(cid, info) VALUES (?,?)"; PreparedStatement stmt = connection.prepareStatement(sql); SQLXML sqlxml = connection.createSQLXML(); //Create a writer to write into the SQLXML object Writer xmlWriter = sqlxml.setCharacterStream(); xmlWriter.write(xmldocumentString); xmlWriter.close(); //Or, create a DOM as input for the SQLXML object DOMResult domResult = sqlxml.setResult(DOMResult.class); domResult.setNode(xmldocumentDOM); //Bind the SQLXML object to the prepared statement and execute stmt.setInt(1, 1097); stmt.setSQLXML(2, sqlxml); stmt.execute(); Figure 21.6 21.3.3 Inserting XML data with JDBC 4.0 Comprehensive Example of Manipulating XML Data with JDBC 4.0 Although DB2 pureXML enables you to avoid a lot of XML parsing in the application layer, access to XML documents through the DOM, SAX, or StAX APIs can still be useful, depending on the design and requirements of your application. SAX, StAX, and DOM are complementary APIs for XML processing. DOM is a tree-based interface that holds the complete XML document in memory and allows easy navigation and manipulation of the XML nodes. SAX and StAX represent an XML document as a stream of events that the application consumes through callbacks. They are stream-based and consume less memory than DOM parsers because they do not hold the entire document in memory. StAX differs from SAX in the way the application accesses the XML data. StAX is a “pull” API because the application asks the parser for the next piece of information from the parsed XML document. SAX is a “push” API because the application receives events as data is encountered within the source document. StAX was added in JDK 6; in JDK 5 it is available as a separate JAR.
622 Chapter 21 Developing XML Applications with DB2 The following sample code demonstrates how to use JDBC 4.0 to exchange XML data with a DB2 database. It also illustrates the use of an SQLXML object with SAX, StAX, and DOM parsers. Comments are embedded throughout the code to explain how it works. package test; import import import import import import import import import import java.io.IOException; java.io.OutputStreamWriter; java.io.StringReader; java.sql.Connection; java.sql.DriverManager; java.sql.PreparedStatement; java.sql.ResultSet; java.sql.SQLException; java.sql.SQLXML; java.sql.Statement; import import import import import import import import import import javax.xml.stream.XMLStreamConstants; javax.xml.stream.XMLStreamException; javax.xml.stream.XMLStreamReader; javax.xml.stream.XMLStreamWriter; javax.xml.transform.dom.DOMResult; javax.xml.transform.dom.DOMSource; javax.xml.transform.sax.SAXResult; javax.xml.transform.sax.SAXSource; javax.xml.transform.stax.StAXResult; javax.xml.transform.stax.StAXSource; import import import import import import import org.w3c.dom.Document; org.xml.sax.ContentHandler; org.xml.sax.InputSource; org.xml.sax.SAXException; org.xml.sax.XMLReader; org.xml.sax.helpers.DefaultHandler; org.xml.sax.helpers.XMLReaderFactory; /* This class demonstrates some of the new SQLXML interfaces in * JDBC 4.0. The code shows how to use SQLXML to * read/write XML directly from/to SAX, StaX and DOM parsers. */ public class JDBC4FeatureTest { // query to be executed by all methods private String queryString = "SELECT info FROM customer " + "WHERE XMLEXSIST('$INFO/customerinfo[@Cid = 1000]')"; // connection to database private Connection con; public JDBC4FeatureTest() { // obtain database connection Class.forName("com.ibm.db2.jcc.DB2Driver").newInstance(); con = DriverManager.getConnection("jdbc:db2:SAMPLE"); } Figure 21.7 A comprehensive example of XML manipulation with JDBC 4.0
21.3 Java Applications public static void main(String[] args) { JDBC4FeatureTest test = new JDBC4FeatureTest(); test.DbToSaxParser(); test.DbToDomTree(); test.DbToStaxParser(); test.SaxParserToDb(); test.DomTreeToDb(); test.StaxParserToDb(); } /** * This method executes a query against a database to obtain * an XML document. The SQLXML type is used to pass the XML * document to a SAX parser. In this example, the parser * simply writes the document content to a stream. */ private void DbToSaxParser() { try { // create and execute statement Statement stmt = con.createStatement(); ResultSet rs = stmt.executeQuery(queryString); if (rs.next()) { // if statement execution returned a document // load query result into SQLXML object SQLXML sqlxml = rs.getSQLXML(1); // create SAX parser from SQLXML object SAXSource source = sqlxml.getSource(SAXSource.class); XMLReader parser = source.getXMLReader(); // configure SAX parser DefaultHandler eventHandler = new SimpleSaxOutput( new OutputStreamWriter(System.out)); parser.setContentHandler(eventHandler); // parse document obtained from database parser.parse(source.getInputSource()); } } catch (Exception e) { e.printStackTrace(); } } /** * This method executes a query against a database to obtain * an XML document. The SQLXML type is used to pass the XML * document to a DOM parser. In this example, the document * content is simply written from the DOM tree to System.out. */ private void DbToDomTree() { try { // create and execute statement Statement stmt = con.createStatement(); Figure 21.7 A comprehensive example of XML manipulation with JDBC 4.0 (Continued) 623
624 Chapter 21 Developing XML Applications with DB2 ResultSet rs = stmt.executeQuery(queryString); if (rs.next()) { // if statement execution returned a document // load query result into SQLXML object SQLXML sqlxml = rs.getSQLXML(1); // obtain DOM tree from SQLXML object DOMSource source = sqlxml.getSource(DOMSource.class); // create document object from DOMSource Document document = (Document) source.getNode(); // process DOM tree SimpleDomOutput.writeDomTree(document, new OutputStreamWriter(System.out)); } } catch (Exception e) { e.printStackTrace(); } } /** * This method executes a query against a database to obtain * an XML document. The SQLXML type is used to pass the XML * document to a StAX parser. In this example, the parser * iterates over all elements and writes all address * information to the standard output stream. */ private void DbToStaxParser() { try { // create and execute statement Statement stmt = con.createStatement(); ResultSet rs = stmt.executeQuery(queryString); if (rs.next()) { // if statement execution returned a document // load query result into SQLXML object SQLXML sqlxml = rs.getSQLXML(1); // get XMLStreamReader from SQLXML object StAXSource source = sqlxml.getSource(StAXSource.class); XMLStreamReader parser = source.getXMLStreamReader(); // output: iterate over all elements and skip elements // that are not descendants of the "addr" element boolean addr = false; for (int event = parser.next(); event != XMLStreamConstants.END_DOCUMENT; event = parser.next()) { switch (event) { case XMLStreamConstants.START_ELEMENT: if (parser.getLocalName().equals("addr")) { addr = true; System.out.println("new address:"); Figure 21.7 A comprehensive example of XML manipulation with JDBC 4.0 (Continued)
21.3 Java Applications } break; case XMLStreamConstants.END_ELEMENT: if (parser.getLocalName().equals("addr")) { addr = false; } break; case XMLStreamConstants.CHARACTERS: if (addr) { System.out.println(parser.getText()); } break; } } } } catch (Exception e) { e.printStackTrace(); } } /** * This method uses the contentHandler of the SQLXML object to * build a document that is then stored in the database. */ private void SaxParserToDb() { try { // prepare statement to insert an xml document PreparedStatement prepStmt = con .prepareStatement("INSERT INTO customer(cid, info) " + "VALUES(1,?)"); // create an SQLXML SAXResult object as a bridge between // SAX parser and the DB2 database SQLXML sqlxml = con.createSQLXML(); SAXResult saxResult = sqlxml.setResult(SAXResult.class); // get the content handler that builds the document ContentHandler contentHandler = saxResult.getHandler(); // create a SAX parser to parse the document XMLReader parser = XMLReaderFactory.createXMLReader(); // parse a document (in this case from a simple string) // to trigger all necessary events in the contentHandler parser.setContentHandler(contentHandler); parser.parse(new InputSource(new StringReader( "<customerinfo Cid=\"1\" " + "<name>Linda Meyers</name>" + "<addr country=\"USA\">" + "<street>555 Bailey Ave</street>" + "<city>San Jose</city>" + "</addr>" + "<phone type=\"cell\">123-654-7896" + "</phone>" + "</customerinfo>"))); Figure 21.7 A comprehensive example of XML manipulation with JDBC 4.0 (Continued) 625
626 Chapter 21 Developing XML Applications with DB2 // execute the prepared statement to insert the // new xml document prepStmt.setSQLXML(1, sqlxml); prepStmt.execute(); } catch (Exception e) { e.printStackTrace(); } } /** * This method creates an XML document as a DOM tree. * The DOM tree is then attached to an SQLXML object * and inserted into the database. */ public void DomTreeToDb() { try { // prepare statement to insert an xml document PreparedStatement prepStmt = con .prepareStatement("INSERT INTO customer(cid, info) " + "VALUES(2,?)"); // create SQLXML object as a bridge between DOM tree and DB2 SQLXML sqlxml = con.createSQLXML(); DOMResult domResult = sqlxml.setResult(DOMResult.class); // create document DOM tree (done in another class) Document document = CreateCustomer.createCustomerDom(); // attach the DOM tree to the SQLXML object domResult.setNode(document); // execute the prepared statement to insert the document prepStmt.setSQLXML(1, sqlxml); prepStmt.execute(); } catch (Exception e) { e.printStackTrace(); } } /** * This method creates an XML document using an XMLStreamWriter. * The document is then attached to an SQLXML object * and inserted into the database. */ public void StaxParserToDb() { try { // prepare statement to insert an xml document PreparedStatement prepStmt = con.prepareStatement ("INSERT INTO customer(cid, info) VALUES(3,?)"); // create SQLXML object as a bridge between StAX and DB SQLXML sqlxml = con.createSQLXML(); Figure 21.7 A comprehensive example of XML manipulation with JDBC 4.0 (Continued)
21.3 Java Applications 627 StAXResult staxResult = sqlxml.setResult(StAXResult.class); // obtain the stream writer from StAXResult XMLStreamWriter streamWriter = staxResult.getXMLStreamWriter(); // create document and write into the stream // (done in another class) CreateCustomer.createCustomerStax(streamWriter); // execute the prepared insert statement prepStmt.setSQLXML(1, sqlxml); prepStmt.execute(); } catch (Exception e) { e.printStackTrace(); } } } Figure 21.7 21.3.4 A comprehensive example of XML manipulation with JDBC 4.0 (Continued) Creating XML Documents from Application Data If your application holds information in application variables and you want to combine this data into an XML document, you can write code to do so. Creating documents is not difficult and can be done with different XML APIs, such as DOM, SAX, or StAX. The sample code in Figure 21.8 shows how to use the DOM API to construct the following XML simple document: <customerinfo Cid="1047"> <name>John Doe</name> <phone type="home">123-456-7890</phone> </customerinfo> In this code sample, the element and attributes values are obtained from hard-coded String variables. They could also come from a file, a web service, user input from a website, or other sources. import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.Text; … String String String String customerID = "1047"; customerName = "John Doe"; phoneNumberType = "home"; phoneNumber = "123-456-7890"; DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); Figure 21.8 Constructing XML as a DOM tree in Java
628 Chapter 21 Developing XML Applications with DB2 DocumentBuilder builder = factory.newDocumentBuilder(); Document document = builder.newDocument(); Element root = document.createElement("customerinfo"); Element name = document.createElement("name"); Element phone = document.createElement("phone"); Text nameValue = document.createTextNode(customerName); Text phoneValue = document.createTextNode(phoneNumber); name.appendChild(nameValue); phone.setAttribute("type", phoneNumberType); phone.appendChild(phoneValue); root.setAttribute("Cid", customerID); root.appendChild(name); root.appendChild(phone); document.appendChild(root); Figure 21.8 Constructing XML as a DOM tree in Java (Continued) The sample code in Figure 21.9 creates the same XML document as in the previous example by using the StAX API and calling a sequence of write commands on a given StreamWriter. import import import import import … javax.xml.parsers.DocumentBuilder; javax.xml.parsers.DocumentBuilderFactory; javax.xml.parsers.ParserConfigurationException; javax.xml.stream.XMLStreamException; javax.xml.stream.XMLStreamWriter; OutputStream out = new FileOutputStream("customer1001.xml"); XMLOutputFactory factory = XMLOutputFactory.newInstance(); XMLStreamWriter writer = factory.createXMLStreamWriter(out); streamWriter.writeStartDocument(); streamWriter.writeStartElement("customerinfo"); streamWriter.writeAttribute("Cid", "1047"); streamWriter.writeStartElement("name"); streamWriter.writeCharacters("John Doe"); streamWriter.writeEndElement(); streamWriter.writeStartElement("phone"); streamWriter.writeAttribute("type", "home"); streamWriter.writeCharacters("123-456-7890"); streamWriter.writeEndElement(); streamWriter.writeEndElement(); streamWriter.writeEndDocument(); streamWriter.flush(); streamWriter.close(); Figure 21.9 Constructing an XML Document with StAX in Java
21.3 Java Applications 21.3.5 629 Binding XML Data to Java Objects In the previous sections we explained how to access an XML document from a Java application using the XML parser interfaces DOM, SAX, and StAX. Each of these APIs requires manual coding in order to process the XML elements and attributes and assign them to application variables, or vice versa. Alternatively, you can use frameworks that automatically perform a mapping between XML documents and Java objects. This mapping is known as XML data binding. It enables your application to abstract from the actual tree structure of XML documents and instead work directly with the data content of those documents in Java object. Popular XML mapping frameworks for Java include JAXB, JiBX, Castor, XMLBeans, and XStream. Their detailed discussion is beyond the scope of this book, but Appendix C, Further Reading, contains pointers to further information. In general, the process of binding XML data to Java objects consists of two phases. First you provide an XML Schema or DTD to the mapping framework. Based on the XML structure in the schema and predefined mapping rules, the framework then generates a set of Java class definitions. In the second phase you can convert an XML document to instances of these Java classes, and vice versa. The process of serializing a Java object into an XML document is called marshalling. The reserve process of building a Java object from an XML document is called unmarshalling. Some of the available mapping frameworks, such as XStream, do not require the initial setup phase with an XML Schema. XStream is driven exclusively by XML instance documents and converts any given XML document into an appropriate Java object, based on predefined mapping rules. XML data-binding frameworks can be useful when retrieving XML documents through object relational mapping frameworks, such as IBM pureQuery. 21.3.6 IBM pureQuery IBM pureQuery is a set of database tools designed to simplify the development of database applications in Java. pureQuery includes an object relational mapping (ORM) framework that aims to relieve Java developers from some of the tediousness associated with JDBC programming. For example, given a database table, pureQuery generates Java classes that represent the table data and contain SQL statements to convert relational data to Java variables. Generating Java classes from database tables is a bottom-up approach. The generated mapping code contains methods and SQL statements for basic create, read, update, and delete operations (CRUD). For the customer table in the sample database, pureQuery generates two Java classes: • The Java Bean Customer.java, which contains a field for each column in the customer table • The mapping class CustomerData.java, which contains the SQL statements in the form of Java annotations
630 Chapter 21 Developing XML Applications with DB2 Figure 21.10 shows excerpts from both classes. public interface CustomerData { // Select all CUSTOMERs @Select(sql = "SELECT CID, INFO" + " FROM CUSTOMER") Iterator<Customer> getCustomers(); // Select CUSTOMER by parameters @Select(sql = "SELECT CID, INFO" + " FROM CUSTOMER" + " WHERE CID = ?") Customer getCustomer(long cid); ... public class Customer { // Class variables protected int cid; protected String info; ... Figure 21.10 Java classes generated by pureQuery Based on these classes, a Java application can make the following calls to retrieve information about specific customers, such as the customer with the relational cid value 1004: Customer customer = CustomerData.getCustomer(1004); This code allows you to access table columns in a purely object-oriented fashion. You can retrieve or update fields of the Java Bean and do not need to write JDBC code or SQL statements. pureQuery also supports the reverse approach; that is, generating table definitions and the corresponding mapping classes based on a group of Java objects (top-down approach). In both the top-down and bottom-up approach the generated mapping classes contain standard SQL statements that an application programmer or DBA can see and modify if required. Modifying the statements can sometimes be useful for SQL optimization and customization. In contrast, many other ORM frameworks use proprietary query languages that hide the actual SQL statements from the application developer because SQL gets generated at runtime only. As shown in Figure 21.10, pureQuery exposes an XML column as a Java String variable. It does not support XML data binding as described in section 21.3.5. However, pureQuery can be extended so that individual nodes within an XML document are accessible to the Java application. This extension can be achieved in several ways and in the following we outline three possible methods.
21.4 NET Applications 631 Use pureQuery with the XMLTABLE Function You can customize the generated SQL SELECT statements to contain an XMLTABLE function. The XMLTABLE function enables you to select specific XML nodes and return them as relational columns. Using this approach, the default ORM functionality of pureQuery can be leveraged. After updating the SQL statement to include the XMLTABLE function, the pureQuery code generator needs to be run again to generate a Java Bean that corresponds to the columns returned by the query. This approach minimizes data transfer between the database and the application because only selected XML node values are fetched. Use pureQuery with a Data-Binding Framework Such as JAXB pureQuery offers extension points that allow you to customize result set handling. Given the generated SQL query that selects an XML column, you can write an extension that uses an XML data binding library such as JAXB to transform XML data from a query result set into Java objects. First you need to use the JAXB library to generate Java classes based on the customer XML Schema. Then you implement a pureQuery RowHandler that uses JAXB methods to unmarshall an XML document from a result row into instances of the previously generated Java classes. Finally, you register the custom RowHandler class with pureQuery by passing it as an argument to the method that executes the SQL query. This approach relieves you from defining the mapping in XMLTABLE functions and encapsulates the mapping in the RowHandler class. The drawback of this approach is that complete documents are transferred from the database to the client application, even though the client might only be interested in certain XML elements. Use pureQuery with an Application-Level XML Parser The third approach places the XML to Java mapping into the Java Bean. You can add fields and corresponding getter and setter methods to the bean to return XML element and attribute values of interest, such as customer name or customer phone. These new methods need to use an XML parser, such as DOM, to parse and extract values from the default info column. However, similar to the JAXB approach, it involves sending the complete XML document from the database to the client. This can have a negative effect on performance due to XML serialization on the DB2 side and extraneous parsing at the client side. This XML parsing is avoided in the first approach with the XMLTABLE function that exploits DB2’s parsed XML storage format. Appendix C contains pointers to more detailed resources covering application development with pureQuery. 21.4 .NET APPLICATIONS If you want to access XML or relational data in a DB2 database from your .NET application, you need an ADO.NET data provider. IBM provides three data providers for .NET applications:
632 Chapter 21 Developing XML Applications with DB2 • DB2 .NET Data Provider • OLE DB .NET Data Provider • ODBC .NET Data Provider All three providers are installed as part of the DB2 Application Development Client. Among them, the DB2 .NET Data Provider is the recommended data provider for use with DB2 family databases. It provides access to databases in DB2 for Linux, UNIX, and Windows and DB2 for z/OS. It has a more extensive set of APIs, fewer restrictions, and provides better performance than the OLE DB and ODBC .NET Data Providers. It is a “managed” provider, which means that it runs entirely within the Common Language Runtime (CLR), and does not translate requests to native OLE or ODBC APIs. The DB2 .NET Data Provider classes are located in the .NET namespaces IBM.Data.DB2 and IBM.Data.DB2Types. One class of particular interest in the namespace IBM.Data.DB2Types is the class DB2Xml, which represents XML data from a DB2 database. 21.4.1 Querying XML Data in .NET Applications Let’s start with a simple example. The sample code in Figure 21.11 shows how to retrieve XML data using the class DB2Xml. The code executes an SQL/XML query against the customer table and retrieves the addr fragment from the XML document. The method DB2Command.ExecuteReader() executes the query. The ExecuteReader() method is used whenever a result set is expected, as in this case. The result set class is DB2DataReader. The DB2DataReader class allows you to loop through the result set and access the columns of the current row based on the column index and the column data type. The method DB2DataReader.GetDB2Xml() retrieves data from the XML column in the result set. This method returns an object of type DB2Xml. using System; using IBM.Data.DB2; using IBM.Data.DB2Types; … public static void readXmlColumn() { try { DB2Command cmd = DB2Connection.CreateCommand(); cmd.CommandText = "SELECT XMLQUERY('$INFO/customerinfo/addr') FROM customer WHERE cid = 1003"; DB2DataReader reader = cmd.ExecuteReader(); DB2Xml doc; String xmlString; while(reader.Read()) Figure 21.11 Retrieving XML data in a C# .NET application
21.4 NET Applications { 633 doc = reader.GetDB2Xml(0); // 0: index for first column xmlString = doc.GetString(); Console.WriteLine(xmlString); } reader.Close(); } catch (Exception e) { Console.WriteLine(e.Message); } Retrieving XML data in a C# .NET application (Continued) Figure 21.11 The class DB2Xml offers several different methods for retrieving XML data. The sample code in Figure 21.11 uses the method getString(). Table 21.8 lists all available methods on the DB2XML class. Table 21.8 Methods of the Class DB2Xml IBM.Data.DB2Types.DB2Xml GetBytes() Returns the contents of the DB2Xml object instance as a UTF-8 byte array GetString() Returns the contents of the DB2Xml object instance as a UTF-16 string GetXMLReader() Returns an XmlReader for the contents of the DB2Xml object instance More details on XMLReader follow. The use of the class DB2Xml is recommended but not mandatory to retrieve XML data into a .NET application. For example, you can also use the methods GetString(), GetBytes(), and GetXMLReader() directly on the DB2DataReader class to obtain data from an XML column. However, using the class DB2Xml is preferred because it can provide performance optimizations and in future DB2 versions it might be enhanced with advanced methods for handling XML data. Queries in XQuery notation can be executed in .NET applications just like SQL statements, but must be prepended with the keyword xquery. For example, the CommandText property in Figure 21.11 could just as well be assigned a FLWOR expression instead of an SQL statement, like this: cmd.CommandText = "xquery for $i in db2-fn:xmlcolumn(""CUSTOMER.INFO"") where $i/customerinfo/@Cid = 1003 return $i/customerinfo/addr"; 21.4.2 Manipulating XML Data in .NET Applications Note that the query in Figure 21.11 retrieves only the addr piece of an XML document and not the whole document. Since DB2 stores XML in a parsed format, the query can return fragments or individual values from the XML document without additional XML parsing. This query provides a significant performance benefit compared to reading the whole document into your
634 Chapter 21 Developing XML Applications with DB2 application code and extracting the desired values there. Use DB2’s pureXML query capabilities to avoid costly XML manipulation in your application as much as possible. If you do need to manipulate XML documents in your application code, you can use the .NET classes XMLReader or XMLDocument. The XmlDocument class implements the XML Document Object Model (DOM) to provide an in-memory tree representation of an XML document. It enables the navigation and editing of a document. The XMLReader class provides an eventbased, forward-only, and read-only access to XML data, which is similar to a Java StAX parser (see section 21.3). You can obtain an XMLReader object from an XML column in a result set as illustrated by the code snippet in Figure 21.12. The while loop iterates over the result set and obtains an XMLReader object for each XML document produced by the query. You can then operate on each XMLReader object as you normally would. DB2Command cmd = DB2Connection.CreateCommand(); cmd.CommandText = "SELECT info FROM customer WHERE XMLEXISTS('$INFO/customerinfo/addr[ city = \"Toronto\"]')"; DB2DataReader reader = cmd.ExecuteReader(); DB2Xml doc; XmlReader myXMLReaderObject; while(reader.Read()) { doc = reader.GetDB2Xml(0); // 0: index for first column myXMLReaderObject = doc.GetXmlReader(); // do something with myXMLReaderObject myXMLReaderObject.Close(); } reader.Close(); Figure 21.12 Obtaining an XmlReader from a DB2Xml object If you execute a query that returns at most one XML document then you can obtain an XMLReader object with the shortcut shown in Figure 21.13. This code sample calls the method ExecuteXMLReader() directly on the DB2Command object. The method ExecuteXMLReader() does not contain a parameter for the column index. Instead, it assumes that the command returns a single row with a single column that contains an XML document. DB2Command cmd = DB2Connection.CreateCommand(); cmd.CommandText = "SELECT info FROM customer WHERE cid=1003"; XmlReader myXMLReaderObject = cmd.ExecuteXmlReader(); Figure 21.13 Obtaining an XmlReader from a DB2Command object If you prefer to manipulate an XML document as a DOM, you can obtain an XMLDocument object from an XMLReader by calling: XMLDocument.load(myXMLReaderObject).
21.4 NET Applications 635 While the XMLDocument object provides read and write access to the XML document via the DOM API, an XPathDocument object provides fast read-only access. You can use the method DB2XmlAdapter.fillSQL() to read XML data from the database into an instance of an XPathDocument object. The DB2 Express-C Developer Handbook listed in Appendix C contains further examples. 21.4.3 Inserting XML Data from .NET Applications Let’s turn to inserting XML data from a .NET application into DB2. The sample code in Figure 21.14 shows how to insert a row with an XML document into the info column of the customer table. In this example, the INSERT statement also extracts the XML attribute Cid from the XML document and inserts it into the relational column cid. The sample code reads the XML document from a file, but could obtain it from any other source. You need to set the data type and value of the parameter before using the method cmd.Parameters.Add to bind the parameter to the INSERT statement. The method DB2Command.ExecuteNonQuery() executes the INSERT statement. ExecuteNonQuery() is used when no result set is expected, as in this case. The customer ID number for the relational id column is extracted by DB2 from the XML document at insert time. using System; using IBM.Data.DB2; using IBM.Data.DB2Types; … try { DB2Command cmd = DB2Connection.CreateCommand(); cmd.CommandText = "INSERT INTO customer(cid, info) SELECT X.id, X.info FROM XMLTABLE ('$d' passing cast(? as XML) as "d" COLUMNS cid INTEGER PATH 'customerinfo/@Cid', info XML PATH '.' ) AS X"; string XMLFile = "customer.xml"; DB2Parameter p1 = cmd.CreateParameter(); p1.DB2Type = DB2Type.XML; p1.Value = File.OpenRead(XMLFile); cmd.Parameters.Add(p1); cmd.Prepare(); cmd.ExecuteNonQuery(); } catch (Exception e) { Console.WriteLine(e.Message); } Figure 21.14 Inserting XML data from a C# .NET application
636 Chapter 21 Developing XML Applications with DB2 The sample code in Figure 21.14 sets the value of the input parameter p1 to the result of reading from a file, which returns a FileStream. Alternatively, DB2Parameter.Value can also be assigned a value of type String, Byte[], XmlReader, or DB2Xml. 21.4.4 XML Schema and DTD Handling in .NET Applications The DB2 .NET Data Provider includes a number of methods that allow you to manage XML Schemas and DTDs in DB2 from within your .NET application (Table 21.9). For example, there are methods to register an XML Schema or a DTD in the XML Schema Repository (XSR) of a DB2 database. Corresponding methods to drop objects from the XSR are also provided. Other methods allow you to obtain DB2’s internal identification number of an XML Schema (XSROBJECTID) and to retrieve the XML Schema documents for a given XSROBJECTID. Table 21.9 .NET Methods for XML Schema and DTD Handling DB2Connection.RegisterXmlSchema Registers an XML Schema in the database DB2Connection.DropXmlSchema Drops an XML Schema in the database DB2DataReader.GetDB2XsrObjectId Creates an instance of a DB2XsrObjectId from XML column data DB2DataReader.GetXmlSchemaCollection Returns an XmlSchemaCollection object of all the schema documents for the given DB2XsrObjectId DB2Connection.RegisterDTD Registers a DTD in the database DB2Connection.DropDTD Drops a DTD in the database When you develop .NET applications you might also be interested in the DB2 Add-on for Visual Studio .NET. This add-on is a collection of GUI tools that assist you with database and XMLrelated tasks. For more details, refer to section 21.9.2. 21.5 CLI APPLICATIONS The DB2 Call Level Interface (CLI) is a callable SQL interface to a DB2 database server. CLI is an application programming interface for C and C++ applications and an alternative to embedded dynamic SQL. Unlike embedded SQL, CLI does not require host variables or a precompiler. CLI applications can retrieve, insert, and update XML data using the CLI SQL type SQL_XML. This data type corresponds to the XML type in DB2 that is used to define columns in tables or input and output parameters for stored procedures and user-defined functions. In your CLI application you can bind the SQL_XML type to the binary C type SQL_C_BINARY or to the character types SQL_C_CHAR, SQL_C_WCHAR, and SQL_C_DBCHAR. We recommend that you use the
21.5 CLI Applications 637 C type SQL_C_BINARY rather than a character data type to avoid code page conversion. Code page conversion incurs overhead and can lead to the loss of data when character code pages are not fully compatible. Chapter 20, Understanding XML Data Encoding, provides more details on this issue. Figure 21.15 shows a CLI code fragment to insert a row into the customer table. The INSERT statement contains two parameters, one for the relational column cid and one for the XML column info. The CLI function SQLBindParameter() binds the application variable custid to the first parameter marker, and the character buffer xmldoc to the second. Note that the second call to SQLBindParameter(), which binds the XML document to a parameter marker, specifies the C type SQL_C_BINARY and the SQL type SQL_XML. char xmldoc[32000]; integer length; SQLSMALLINT custid = 1099; // assume that "xmldoc" contains the input document length = strlen (xmldoc); // allocate a statement handle: SQLHANDLE stmt; SQLAllocHandle(SQL_HANDLE_STMT, connection, &stmt); // prepare the insert statement: SQLPrepare(stmt, "INSERT INTO customer(cid, info) VALUES(?,?)", SQL_NTS); // bind parameter values and execute the statement: SQLBindParameter(stmt, 1, SQL_PARAM_INPUT, SQL_C_SHORT, SQL_SMALLINT, 0, 0, &custid, 0, NULL); SQLBindParameter(stmt, 1, SQL_PARAM_INPUT, SQL_C_BINARY, SQL_XML, 0, 0, xmldoc, 32000, &length); SQLExecute(stmt); Figure 21.15 CLI code fragment to insert an XML document Figure 21.16 shows a CLI code fragment that issues an SQL/XML query with a parameter marker against the customer table. The query retrieves the addr element for the customers that live in a specific zip code. The function SQLBindParameter() binds the variable zipcode to the query parameter. The function SQLBindCol() binds the string buffer xmldoc to the XML column that the query returns. Remember that the function XMLQUERY always returns a column of type XML. The target data type in the call to SQLBindCol() is specified as SQL_C_BINARY to ensure that the XML data is returned in UTF-8 and not converted to the application code page.
638 Chapter 21 Developing XML Applications with DB2 char xmldoc[32000]; integer length; char zipcode[10]; length = sizeof (xmldoc); // Prepare the query: SQLPrepare(stmt, " SELECT XMLQUERY('$INFO/customerinfo/addr') FROM customer WHERE XMLEXISTS('$INFO/customerinfo/addr[zip = $z]' PASSING cast(? as VARCHAR(10) as \"z\")) ", SQL_NTS); SQLBindParameter(stmt, 1, SQL_PARAM_INPUT, SQL_C_CHAR, SQL_CHAR, 10, 0, zipcode, 10, NULL); // Now execute the query for zip code N9C 3T6 strcpy(zipcode, "N9C 3T6"); SQLExecute(stmt); // Bind the returned XML column and fetch the first row: SQLBindCol(stmt, 1, SQL_C_BINARY, xmldoc, &length, NULL); SQLFetch (stmt); Figure 21.16 CLI code fragment to read an XML document fragment The default behavior for a CLI application is that each XML value that is retrieved from the database is given an XML declaration with an encoding attribute. For example, the first row returned by the query in Figure 21.16 could look like this: <?xml version="1.0" encoding="UTF-8" ?> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> You can choose to omit the XML declaration in one of three ways: • Use the function SQLSetStmtAttr() to set the statement attribute SQL_ATTR_XML_ DECLARATION to 0 per statement. • Use the function SQLSetConnectAttr() to set the connection attribute SQL_ATTR_ XML_DECLARATION to 0, per connection. This attribute affects any statement handles allocated after the value is changed. • For a specific database, set the CLI/ODBC configuration keyword XMLDeclaration in the db2cli.ini file to 0. This setting affects all DB2 CLI and ODBC applications that access the database.
21.6 Embedded SQL Applications 639 XML queries in XQuery notation can be issued and executed in CLI applications just like SQL statements. An XQuery must be prepended with the keyword xquery. 21.6 EMBEDDED SQL APPLICATIONS Embedded SQL applications are different from CLI, .NET, PHP, or JDBC applications because they can use host variables and static SQL in addition to dynamic SQL with parameter markers. While dynamic SQL statements are compiled at application runtime, static SQL is compiled by a precompiler that reads the application source code and converts embedded SQL statements into DB2 API calls. The precompiler produces a modified version of your application source code as well as access plans that are stored as a package in the database. Embedded SQL applications declare host variables in the application source code to exchange data with the DB2 server. For XML data, DB2 provides six new XML types for host variables in Assembler, C, C++, COBOL, and PL/1 applications. These XML types are based on LOB types for variables and files. The six XML types and their encoding properties are listed in Table 21.10. The XML host variable types are compatible with the XML data type in DB2. To avoid codepage conversion issues, XML AS BLOB is preferred over XML AS CLOB and XML as DBCLOB. Table 21.10 XML Types for Host Variables Type Declaration Encoding SQL TYPE IS XML AS CLOB XML data that is encoded in the application codepage (externally encoded) SQL TYPE IS XML AS BLOB XML data that is internally encoded SQL TYPE IS XML AS DBCLOB XML data that is encoded in the application graphic codepage (externally encoded) SQL TYPE IS XML AS CLOB_FILE XML data in a file that is encoded in the application codepage (externally encoded) SQL TYPE IS XML AS BLOB_FILE XML data in a file that is internally encoded SQL TYPE IS XML AS DBCLOB_FILE XML data in a file that is encoded in the application graphic codepage (externally encoded) Remember that XML data is stored in DB2 in the UTF-8 codepage. If an embedded SQL application retrieves XML data into a host variable of type XML AS CLOB, the XML data is converted from UTF-8 to the code page of the application. An XML declaration is added to the retrieved XML data and it contains an encoding attribute that indicates the application codepage. This behavior guarantees that external and internal encoding of the retrieved XML data are consistent. The internal and external encoding might be inconsistent if you retrieve XML data into a host variable of type CLOB, instead of XML AS CLOB.
640 Chapter 21 Developing XML Applications with DB2 When you retrieve XML data into a host variable of type XML AS BLOB, which is the preferred method, the XML data is not converted to the application codepage and remains in UTF-8. The retrieved XML data is internally encoded but not externally encoded, and includes an XML declaration where the encoding attribute indicates UTF-8: <?xml version = "1.0" encoding = "UTF-8"?> Irrespective of the type of the host variable, you can use the function XMLSERIALIZE to avoid the generation of an XML declaration. Let’s consider the queries in Figure 21.17 as an example. Both queries retrieve XML documents from the XML column info. The column produced by the first query is of type XML and can be bound to a host variable that has an XML type such as XML AS BLOB or XML AS CLOB. The second query serializes the XML documents explicitly to data type BLOB, which has two effects. First, no XML declaration is generated for the documents unless you add the optional keywords INCLUDING XMLDECLARATION to the XMLSERIALIZE function. Second, since the returned column type is not XML, it cannot be bound to a host variable that has an XML type such as XML AS BLOB. This column must be bound to a host variable of type BLOB. SELECT info FROM customer; SELECT XMLSERIALIZE(info as BLOB) FROM customer; Figure 21.17 Using XMLSERIALIZE to avoid an XML declaration There is no support for static XQuery. If you try to precompile an XQuery statement, DB2 produces an error. You can execute XQuery only if the XQuery expression is embedded in an SQL statement through SQL/XML functions such as XMLQUERY or XMLTABLE. The following sections provide examples of manipulating XML data in COBOL, PL/1, and C applications with host variables and embedded SQL statements. Although these three languages differ in the syntax for host variable declarations, the SQL and SQL/XML statements shown are not host language specific. We therefore present slightly different examples for COBOL, PL/1, and C to provide a larger overall set of samples. 21.6.1 COBOL Applications with Embedded SQL Enterprise COBOL 4.1 for z/OS has enhanced its XML features. For example, it enables COBOL applications to parse and extract values from XML documents, populate COBOL data structures with values from XML documents, validate XML documents, and generate XML documents from application data. However, when you develop COBOL applications for use with DB2 pureXML, it is often better to exploit DB2’s XML capabilities than to manipulate XML in the application code (see section 21.1.1).
21.6 Embedded SQL Applications 641 In a COBOL application, the declaration of a host variable with the name MYDOCUMENT, data type XML AS BLOB, and maximum size of 1MB, looks like this: 01 MYDOCUMENT USAGE IS SQL TYPE IS XML AS BLOB(1M). On z/OS, an XML host variable of type XML AS BLOB(1M) or XML AS CLOB(1M) is converted by the DB2 precompiler into the following variable in your application: 01 MYDOCUMENT. 02 MYDOCUMENT-LENGTH PIC 9(9) COMP. 02 MYDOCUMENT-DATA. 49 FILLER PIC X(32767). 49 FILLER PIC X(32767). . . . 49 FILLER PIC X(32). The XML type variable is declared in chunks of 32,767 bytes or less. In this example, the variable of size 1MB is represented by 32 chunks of 32,767 bytes plus one chunk of 32 bytes (32 × 32767 + 32 = 1,048,576 = 1MB). If the variable is of type XML AS DBCLOB, it is declared in chunks of 32,767 or fewer double-byte characters. The sample COBOL code in Figure 21.18 shows declarations of XML host variables and their usage in INSERT, SELECT, and UPDATE statements. The INSERT statement contains the literal value 1006 for the CID column of the customer table, but it could also include a subselect with an XMLTABLE function to extract the Cid attribute value from the XML document. *** Host variable declarations EXEC SQL BEGIN DECLARE SECTION END-EXEC. 01 MYDOC USAGE IS SQL TYPE IS XML as BLOB(50K). 01 MYDOC2 USAGE IS SQL TYPE IS XML AS CLOB(1M). 01 MYCLOB USAGE IS SQL TYPE IS CLOB(10K). EXEC SQL END DECLARE SECTION END-EXEC. *** Insert a document into an XML column. EXEC SQL INSERT INTO CUSTOMER(CID, INFO) VALUES (1006, :MYDOC) END-EXEC. *** Update an XML column with an XML AS CLOB variable EXEC SQL UPDATE CUSTOMER SET INFO = :MYDOC2 WHERE XMLEXISTS('$i/customerinfo[@Cid = 1004]' PASSING info as "i") END-EXEC. Figure 21.18 COBOL code to read and write XML data (continues)
642 Chapter 21 Developing XML Applications with DB2 *** Retrieve the addr fragment of an XML document into *** an XML AS BLOB host variable. The XML data is in UTF-8 *** and has an XML declaration with encoding attribute. EXEC SQL SELECT XMLQUERY('$i/customerinfo/addr' PASSING info as "i") INTO :MYDOC FROM CUSTOMER WHERE XMLEXISTS('$i/customerinfo[@Cid = 1003]' PASSING info as "i") END-EXEC. *** Retrieve an XML document into a CLOB variable. The XML *** data is converted to the application code page and *** has no XML declaration. EXEC SQL SELECT XMLSERIALIZE(INFO AS CLOB(10K)) INTO :MYCLOB FROM CUSTOMER WHERE XMLEXISTS('$i/customerinfo[@Cid = 1005]' PASSING info as "i") END-EXEC. Figure 21.18 COBOL code to read and write XML data (Continued) In many situations you might not want to retrieve full XML documents or document fragments into LOB variables. Instead it can be very useful to extract individual values from an XML document and read them into dedicated variables. The code sample in Figure 21.19 uses an SQL/XML statement with the XMLTABLE function to extract the values of the name, street, and city elements into corresponding host variables. The query also uses a host variable in the XMLEXISTS predicate to select information for one specific customer. EXEC 01 01 01 01 EXEC SQL BEGIN DECLARE SECTION END-EXEC. name pic x(30). street pic x(35). city pic x(20). cid pic s9(4) comp-5. SQL END DECLARE SECTION END-EXEC. MOVE 1005 TO cid. EXEC SQL SELECT X.custname, X.str, X.city INTO :name, :street, :city FROM CUSTOMER, XMLTABLE('$i/customerinfo' PASSING info AS "i" COLUMNS custname VARCHAR(30) PATH 'name', str VARCHAR(35) PATH 'addr/street', city VARCHAR(20) PATH 'addr/city' ) as X WHERE XMLEXISTS('$i/customerinfo[@Cid = $c]' PASSING CAST(:cid AS INTEGER) AS "c") END-EXEC. Figure 21.19 Extracting XML values into host variables in COBOL
21.6 Embedded SQL Applications 21.6.2 643 PL/1 Applications with Embedded SQL In a PL/1 application, the declaration of a host variable with the name MYDOCUMENT, data type XML AS BLOB, and maximum size of 100KB, looks like this: DCL MYDOCUMENT SQL TYPE IS XML AS BLOB (100K); The DB2 precompiler takes this declaration as input and generates the following variable in your application: DCL 1 2 2 MYDOCUMENT, MYDOCUMENT_LENGTH BIN FIXED(31), MYDOCUMENT_DATA,1 3 MYDOCUMENT_DATA1 (3) CHAR(32767), 3 MYDOCUMENT_DATA2 CHAR(4099); The precompiler generates the same application variable if the host variable was declared as XML AS CLOB(100K). The XML type variable, which has a size of 100K in this example, is represented by an array of strings of length 32,767. Since 100K is not evenly divisible by 32,767, an additional character string is declared to hold the remainder. 100K is 102,400 bytes and allocated as three chunks of 32,767 bytes, totaling 98,301 bytes, plus a string of 4,099 bytes (98,301 + 4,099 = 102,400). If the variable is of type XML AS DBCLOB, the precompiler declares chunks of 16,383 double-bytes plus an additional character string for the remainder. Figure 21.20 shows a PL/1 program that inserts data from an XML AS BLOB host variable into an XML column. It also shows the retrieval of XML data into an XML AS BLOB host variable and a CLOB host variable, respectively. /* Host variable declarations */ EXEC SQL BEGIN DECLARE SECTION; DCL MYDOC SQL TYPE IS XML AS BLOB (100K), MYCLOB SQL TYPE IS CLOB(10K); CID BIN FIXED(31), EXEC SQL END DECLARE SECTION; *** Insert a document into an XML column and extract the cid *** attribute into the CID column of the table: EXEC SQL INSERT INTO CUSTOMER(CID, INFO) SELECT id, doc FROM XMLTABLE('$i' PASSING CAST(:MYDOC AS XML) AS "i" COLUMNS id INTEGER PATH 'customerinfo/@Cid', doc XML PATH '.') as X; Figure 21.20 PL/1 code to insert and select XML data (continues)
644 Chapter 21 Developing XML Applications with DB2 *** Retrieve an XML document into an XML AS BLOB host variable. *** The XML data is in UTF-8 and has an XML declaration. EXEC SQL SELECT CID, INFO INTO :CID, :MYDOC FROM CUSTOMER WHERE XMLEXISTS('$i/customerinfo[@Cid = 1003]' PASSING info as "i"); *** Retrieve a piece of an XML document into a CLOB variable. *** The XML data is converted to the application code page *** and has no XML declaration. EXEC SQL SELECT XMLSERIALIZE(XMLQUERY('$i/customerinfo/addr' PASSING info as "i") AS CLOB(10K)) INTO :MYCLOB FROM CUSTOMER WHERE XMLEXISTS('$i/customerinfo[@Cid = 1005]' PASSING info as "i"); Figure 21.20 PL/1 code to insert and select XML data (Continued) If you do not want to retrieve full XML documents from DB2 you can use the XMLTABLE function to extract individual values from XML documents into host variables. This technique is illustrated by the code sample in Figure 21.21. The query shown also uses a host variable in the XMLTABLE function to select a specific XML document in the table. Alternatively, an XMLEXISTS predicate can be used. EXEC SQL BEGIN DECLARE SECTION; DCL cid BIN FIXED(31), name CHAR(30) VAR, phone CHAR(15) VAR, city CHAR(20) VAR; EXEC SQL END DECLARE SECTION; /* assume phone holds the number for a specific customer */ EXEC SQL SELECT X.id, X.custname, X.city INTO :cid, :name, :city FROM CUSTOMER, XMLTABLE('$i/customerinfo[phone = $p]' PASSING info AS "i", CAST(:phone AS VARCHAR(15)) AS "p" COLUMNS id INTEGER PATH '@Cid', custname VARCHAR(30) PATH 'name', city VARCHAR(20) PATH 'addr/city' ) as X; Figure 21.21 Extracting XML values into host variables in PL/1
21.6 Embedded SQL Applications 21.6.3 645 C Applications with Embedded SQL In a C application with embedded SQL, the declaration of a host variable with the name MYDOCUMENT, a data type of XML AS BLOB, and a maximum size of 1MB, looks like this: SQL TYPE IS XML AS BLOB(1M) MYDOCUMENT; An XML host variable of type XML AS BLOB(1M) or XML AS CLOB(1M) is converted by the DB2 precompiler into the following variable in your application: struct { unsigned long length; char data[1048576]; } MYDOCUMENT; The code sample in Figure 21.22 illustrates the use of an XML AS CLOB host variable to insert XML data into an XML column. The XML data is externally encoded in the code page of the application because it is passed to DB2 in a character variable. When the INSERT statement is processed, the XML document is converted to the database codepage and then to UTF-8, which is the codepage for all XML storage. If the database codepage is not UTF-8, intermediate conversion to the database codepage can be avoided if you use XML AS BLOB instead of XML AS CLOB. EXEC SQL BEGIN DECLARE SECTION; SQL TYPE IS XML AS CLOB(10K) mydoc; CHAR docstring[5000]; EXEC SQL END DECLARE SECTION; /* Create an XML document */ strcpy (docstring, "<customerinfo cid=\"1055\">" "<name>John Doe</name>" "<phone type=\"cell\">408-463-4963</phone>" "</customerinfo>"); /* Set the data and length of the host variable mydoc */ strcpy(mydoc.data, docstring); mydoc.length = strlen(docstring) + 1; /* Insert the document */ EXEC SQL INSERT INTO customer(cid, info) VALUES (1101, :mydoc); Figure 21.22 Inserting an XML document with embedded SQL in C The code sample in Figure 21.23 uses a cursor and an SQL/XML statement to retrieve the phone elements where the type is work for all customers who live in Berlin. Each phone element is returned in UTF-8 format with its start tag and end tag. If you want to obtain the phone number in a CHAR variable without the XML tags, declare the host variable myphone as CHAR and change the query to return a VARCHAR column using the XMLTABLE or XMLCAST function.
646 Chapter 21 Developing XML Applications with DB2 EXEC SQL INCLUDE SQLCA; EXEC SQL BEGIN DECLARE SECTION; SQL TYPE IS XML AS BLOB(1K) myphone; CHAR city[20]; CHAR phonetype[10]; EXEC SQL END DECLARE SECTION; strcpy (city, "Berlin"); strcpy (phonetype, "work"); /* Declare a cursor for a SQL/XML query */ EXEC SQL DECLARE cur1 CURSOR FOR SELECT XMLQUERY('$i/customerinfo/phone[@type= $t]' PASSING info as "i", CAST(:phonetype AS VARCHAR(15))as "t") FROM customer WHERE XMLEXISTS('$i/customerinfo[addr/city = $c]' PASSING info as "i", CAST(:city AS VARCHAR(15))as "c"); /* Open the cursor and fetch all rows */ EXEC SQL OPEN cur1; while( sqlca.sqlcode == SQL_RC_OK ) { EXEC SQL FETCH cur1 INTO :myphone; /* Consume and process the fetched phone elements here*/ } EXEC SQL CLOSE cur1; Figure 21.23 Retrieving XML elements with embedded SQL in C When you develop a C application with embedded SQL for use with a database in DB2 for Linux, UNIX, and Windows, you can also execute queries in XQuery notation without SQL. You execute XQuery dynamically, not statically, as demonstrated in Figure 21.24. The statement string must begin with the keyword xquery. You can then prepare and execute the query like a dynamic SQL statement. EXEC SQL INCLUDE SQLCA; EXEC SQL BEGIN DECLARE SECTION; CHAR stmt[2000]; SQL TYPE IS XML AS BLOB(10K) mydoc; EXEC SQL END DECLARE SECTION; sprintf( stmt, "xquery for $i in db2-fn:xmlcolumn(\"CUSTOMER.INFO\") where $i/customerinfo/addr[city = \"Aurora\"] return <cust>{$i/name}{$i/phone}</cust>" ); EXEC SQL PREPARE st1 FROM :stmt; EXEC SQL DECLARE cur1 CURSOR FOR st1; EXEC SQL OPEN cur1; Figure 21.24 Executing XQuery in a C application with embedded SQL
21.7 PHP Applications 647 while( sqlca.sqlcode != 100 ) { EXEC SQL FETCH cur1 INTO :mydoc; /* Display results */ } EXEC SQL CLOSE cur1; Figure 21.24 21.7 Executing XQuery in a C application with embedded SQL (Continued) PHP APPLICATIONS PHP is an interpreted programming language that has gained increasing popularity for the development of Web applications. PHP is a modular language that allows for extensions to provide additional or customized functionality in the language. For example, PHP 5 includes new extensions for processing XML data such as SimpleXML, XMLReader, and XMLWriter. SimpleXML lets you convert an XML document into an object that can be processed with normal property selectors and array iterators. Other extensions for PHP facilitate read and write access to databases so that you can easily create a dynamic database-driven Web application. PHP is a programming language not only for distributed platforms. IBM also offers a port of PHP 5.1 to the z/OS UNIX System Services platform (see Appendix C for the URL). IBM offers two PHP extensions for database access, called ibm_db2 and pdo_ibm. You can use either extension to access data in a DB2 family database from your PHP application. Both extensions are included as part of the IBM Data Server Client but can also be downloaded from the PHP Extension Community Library (PECL) at http://pecl.php.net/. The extension pdo_ibm is a driver for PHP Data Objects (PDO) and offers access to DB2 databases through the standard object-oriented database interface introduced in PHP 5.1. The extension ibm_db2 offers a procedural application programming interface (API) for database operations such as CREATE, INSERT, SELECT, and UPDATE and also provides access to the database metadata. The complete list of all DB2 PHP functions in this extension is documented at http://www.php.net/manual/en/ref.ibm-db2.php, which is an excellent reference if you develop PHP applications for DB2. You can compile the ibm_db2 extension with either PHP 4 or PHP 5. The following examples all use the ibm_db2 extension. Figure 21.25 shows the code to insert an XML document into the info column of the customer table. The customer table also contains an INTEGER column called cid. The value for this column is extracted from the XML document by DB2 as part of the INSERT statement. The application does not need to parse the XML document to extract this value before performing the insert. In this example, the XML document to insert is retrieved from a file and assigned to the variable $mydoc. The function db2_bind_param() binds the document to the parameter marker of the prepared INSERT statement. Note that the third parameter of the function db2_bind_param() is a variable name “mydoc” as a string literal rather than the variable $mydoc itself. After you
648 Chapter 21 Developing XML Applications with DB2 have called db2_prepare() once, you can call db2_bind_param() and db2_execute() repeatedly to insert multiple documents. The function db2_execute() always returns either true or false to indicate the success or failure of the statement execution. // Read the XML document from the file into a variable $mydoc = file_get_contents("customer.xml"); // Create a string that holds the INSERT statement: $insert = "INSERT INTO customer SELECT T.cid, T.info FROM XMLTABLE ('$d' passing cast(? as XML) as "d" COLUMNS cid INTEGER PATH 'customerinfo/@Cid', info XML PATH '.' ) AS T"; // Create a prepared statement: $stmt = db2_prepare($connection, $insert); // Bind the XML file object to the first parameter marker: db2_bind_param($stmt, 1, "mydoc", DB2_PARAM_IN); // Execute the statement: $success = db2_execute($stmt); if ($success) { print "New customer inserted."; } Figure 21.25 PHP code to insert an XML document Figure 21.26 shows the code to execute an XQuery against the customer table. The query retrieves the complete XML document for each customer who lives in Aurora. Since XQuery does not use SQL style parameter markers, there is no benefit in preparing an XQuery statement. The query string is executed directly with the db2_exec() function, which returns a statement resource if the execution was successful. The function db2_fetch_array() returns each result row as an array indexed by column position. Since XQuery always returns a single column, you only need to access the first element of the array at index 0. By default, the result cursor is a forward-only cursor that returns the next row of the result set for each fetch call. // Build a query string $query = "xquery for $i in db2-fn:xmlcolumn('CUSTOMER.INFO') where $i/customerinfo/addr/city = 'Aurora' return $i"; // Execute the xquery $stmt = db2_exec($connection, $query); Figure 21.26 PHP code to execute an XQuery
21.7 PHP Applications 649 // Loop through the result set while($row = db2_fetch_array($stmt)){ printf("$row[0]\n"); } Figure 21.26 PHP code to execute an XQuery (Continued) If you prefer to retrieve only particular information for a customer instead of the complete XML document, let DB2 extract the values for you and avoid costly XML parsing in your PHP code. In Figure 21.27 , an XMLTABLE function is used to extract name, street, and city information for the customer Jim Noodle. Note that the XMLTABLE function contains a predicate and a parameter marker for the customer name that selects a specific document. The parameter marker allows you to prepare the query just once but execute it many times, each time with a different customer name as input. Preparing the query only once avoids repeated compilation at the DB2 server and saves CPU cycles. In this example the result is fetched with the function db2_fetch_ object(). While db2_fetch_array() returns each result row as an array, db2_fetch_ object() allows you to access the columns of the result set as properties of a result row object. // Build a query string: $query = "SELECT T.custname, T.street, T.city FROM customer, XMLTABLE('$INFO/customerinfo[name = $n]' PASSING cast(? as VARCHAR(25)) as "n" COLUMNS custname VARCHAR(20) PATH 'name', street VARCHAR(20) PATH 'addr/street', city VARCHAR(16) PATH 'addr/city') AS T"; // Create a prepared statement: $stmt = db2_prepare($connection, $query); // Bind a value to the parameter marker: $searchname = "Jim Noodle" db2_bind_param($stmt, 1, "searchname", DB2_PARAM_IN); // Execute the query: $success = db2_execute($stmt); // Loop through the result set: if ($success) { while($row = db2_fetch_object($stmt)){ printf("$row-> custname , $row->street , $row->city\n"); } } Figure 21.27 PHP code to extract selected elements from a document
650 21.8 Chapter 21 Developing XML Applications with DB2 PERL APPLICATIONS The DB2 Perl driver (called DBD::DB2) allows you to query and manipulate XML data in a DB2 database. Use DBD::DB2 version 1.6 or higher. For example, you can insert XML documents from a Perl application into a column of type XML. You can also send XQuery or SQL/XML queries to DB2 and retrieve the XML data in the result set either as a BLOB or a Record. Note that the DBD::DB2 driver supports only dynamic SQL, not static SQL. For information about the DB2 Perl Database Interface and information on how to download the latest DBD::DB2 driver, go to http://search.cpan.org/~ibmtordb2/. The Perl code in Figure 21.28 connects to the database, prepares two queries, and fetches their results as a Record and BLOB respectively. The example assumes that a DB2 database with the name perldb contains the customer table. #!/usr/bin/perl use DBI; my $dbname='dbi:DB2:perldb'; my $dbuser=''; my $password=''; my $dbhandle = DBI->connect($dbname, $dbuser, $password) or die "Connection failed: $DBI::errstr"; ### Statement 1: SQL/XML query to extract customers names ### for a given zip code: $stmt1 = q( SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE XMLEXISTS('$INFO/customerinfo/addr[zip = $z]' PASSING CAST(? as VARCHAR(10)) as "z") ); ### Statement 2: XQuery to retrieve the address ### for a certain customer: $stmt2 = q( xquery for $i in db2-fn:xmlcolunm("CUSTOMER.INFO")/customerinfo where $i/name = "Matt Foreman" return $i/addr ); ### Prepare and execute Statement 1, fetch result as a record: my $zipcode = '95141'; $stmthandle = $dbhandle->prepare($stmt1); $stmthandle->bind_param(1,$zipcode); $stmthandle->execute(); #associate a variable with the output column $stmthandle->bind_col(1,\$custname); while ($stmthandle ->fetch) Figure 21.28 Querying XML data in a Perl application
21.9 XML Application Development Tools { } 651 print $custname ; print "\n"; ### Prepare and execute Statement 2, and fetch result as a BLOB $dbhandle->{LongReadLen} = 0; $stmthandle = $dbhandle->prepare($stmt2) $stmthandle->execute(); my $offset = 0; my $buffer=""; #Retrieve the results and print on the screen while($stmthandle->fetch()) { while( $buffer = $stmthandle->blob_read(1,$offset,1000000)) { print "\n"; print $buffer; $offset+=length($buffer); $buffer=""; } } Figure 21.28 Querying XML data in a Perl application (Continued) For XML document manipulation, such as extracting values or modifying documents, it is recommended that you use DB2’s XQuery and SQL/XML capabilities as much as possible. If you do need to perform additional XML manipulation in your Perl application, consider using one of the popular XML modules, such XML::Simple, XML::LibXML, XML::SAX, and XML::DOM. These and others can be found on the CPAN website. 21.9 XML APPLICATION DEVELOPMENT TOOLS A variety of tools from various vendors exist to help you develop XML applications. You should use your favorite IDE (Integrated Development Environment) for the programming language that you are using, such as Eclipse for Java development, or Visual Studio for .NET development. Some of these IDEs have XML-specific capabilities or allow you to design and test stored procedures. In some cases you might need additional tools for the design and manipulation of XML artifacts such as XML Schemas, XML documents, and XSLT style sheets. This section provides an overview of the following XML and application development tools that support DB2 pureXML particularly well: • IBM Data Studio Developer • IBM Database Add-ins for Visual Studio • Altova XML Tools (XMLSpy, MapForce, StyleVision, and DatabaseSpy) • <oXygen/> • Stylus Studio
652 21.9.1 Chapter 21 Developing XML Applications with DB2 IBM Data Studio Developer IBM Data Studio Developer provides a suite of integrated tools for database administrators and application developers. Data Studio Developer is based on the Eclipse framework and supports a variety of databases, including DB2 for z/OS and DB2 for Linux, UNIX, and Windows. Using Data Studio Developer, you can develop and test SQL, XQuery, and SQL/XML queries as well as SQL and Java stored procedures. You can also generate and deploy data-centric Web services and develop and optimize Java applications. Data Studio Developer also supports database administration tasks, such as creating and altering database objects, managing privileges, and generating DDL statements. A profiler for SQL stored procedures allows you to get detailed performance information for each statement in a stored procedure, including elapsed time, CPU time, and the number of logical and physical I/O operations performed. IBM Data Studio Developer also supports the design and manipulation of XML artifacts, such as XML columns, XML documents, XML Schemas, XSLT style sheets, and XML queries and updates. Figure 21.29 shows a screenshot of Data Studio Developer with the Data Project Explorer and Database Explorer on the left, the Editor in the top center, the Outline View on the right, and the Output and Properties panel below the Editor and Outline View. The Database Explorer shows information about the database SAMPXML. There are folders for buffer pools, schemas, and other types of database objects. The Schemas folder is opened and allows access to the database objects for the database schema MNICOLA. One of the tables in this schema, CUSTOMER, has been opened to view sample content. The Output panel shows the three columns of the table, CID, INFO, and HISTORY. The columns INFO and HISTORY are XML columns and each of their rows carries a little button with three dots. You can click this button to show the XML document from that row and column in the Editor and Outline view above. Then you can view or modify the XML document and save it back into the database. Note that the editor for XML documents offers a Design and Source view. Figure 21.29 shows the Source view. Figure 21.30 illustrates the Design view of the XML Editor. The Design view shows the nested structure of the XML elements and attributes together with their values. A right-click on any node in the document structure opens a context menu with applicable actions. The context menu is being used to add another phone element before the existing phone element.
21.9 XML Application Development Tools Figure 21.29 653 Browsing and viewing XML columns in Data Studio Developer Data Studio Developer allows you to design new XML Schemas or edit existing schemas that you import from the file system or open from DB2’s XML Schema Repository. The center of Figure 21.31 shows the Source View of the XML Schema Editor. The editor uses syntax highlighting to improve the readability of the XML Schema. Simultaneously, the Outline view shows the structure of the XML Schema in a tree format together with data types, occurrence indicators, and other details of the items in the schema. You can edit the XML Schema in the Design or Source View of the editor, as well as in the Outline view on the right. A context menu is being used to change the number of allowed occurrences of the element pcode-zip. If you design or modify an XML Schema you can subsequently register it in the XML Schema Repository of a DB2 database.
654 Figure 21.30 Chapter 21 Developing XML Applications with DB2 XML document viewer and editor in Data Studio Developer Data Studio Developer also supports the creation of XML queries and XML updates. Figure 21.32 shows the SQL and XQuery editor. If you press CTRL-SPACE, the context assist allows you to choose from a list of query and function templates. At the top of the editor a template of an XMLTABLE query has already been inserted. It provides a skeleton that you can fill out with actual table names, column names, and path expressions. It ensures that all required clauses and keywords are in place. Such templates exist for SQL/XML and XQuery as well as for XQuery Update expressions.
21.9 XML Application Development Tools Figure 21.31 XML Schema editor in Data Studio Developer Figure 21.32 Query editor with content assist for SQL/XML and XQuery 655
656 21.9.2 Chapter 21 Developing XML Applications with DB2 IBM Database Add-ins for Visual Studio If your primary environment for application development is Microsoft Visual Studio 2005 or Visual Studio 2008, we recommend that you use the IBM Database Add-ins for Visual Studio. These add-ins are a collection of features and wizards that integrate into your Visual Studio development environment. They simplify your work with a DB2 database and the design of DB2 tables, indexes, schemas, queries, and applications. When you install the DB2 Server or the DB2 client, you will be presented with an option to install the IBM Database Add-ins for Visual Studio. Alternatively you can download them at http://www.ibm.com/software/data/db2/ windows/dotnet.html. The IBM Database Add-ins for Visual Studio provide the following capabilities for developing XML applications for DB2: • Support of the XML data type in the Visual Studio table and stored procedure designer • XML data visualization using Visual Studio’s built-in XML document designer and editor • A wizard to create XML indexes, with automatic generation of namespace declarations • XQuery and SQL/XML query designer with syntax colorization and auto-completion • Integration of the DB2 XML Schema Repository and the .NET XML Schema editor • Register, edit, and drop XML Schemas in the DB2 XML Schema Repository • Validation of XML documents with XML Schemas • Compare and “diff” XML Schemas • Visual design of annotated XML Schemas with a mapping editor • Generation of sample XML documents from an XML Schema • XML data transformation using XSLT • XML data import and export 21.9.3 Altova XML Tools Altova is one of the leading vendors for XML design, editing, and mapping tools. Their flagship tools XMLSpy, MapForce, StyleVision, and DatabaseSpy have been enhanced so that relational as well as XML data and schemas in DB2 can be manipulated. Altova tools support working with DB2 for z/OS and DB2 for Linux, UNIX, and Windows. The tools connect to a DB2 database and allow you to add, browse, edit, update, or convert XML and relational data. Any potential error messages produced by DB2 are retrieved and displayed in the message pane of Altova’s GUI, so you can take corrective action.
21.9 XML Application Development Tools 657 XMLSpy XMLSpy is one of the most popular XML editors and offers a full IDE for working with all XML-related technologies, including XML instance documents, XML Schemas, XQuery, XSLT, XPath, and more. XMLSpy’s deep integration with DB2 pureXML allows you to • Edit, debug, and profile XQuery statements against XML data in DB2 databases. Query results are then available for further manipulation in XMLSpy. • Visualize the database structure and query DB2 tables using SQL, SQL/XML, and XQuery statements. • Read XML data from DB2, edit it, and store it back in DB2 with optional schema validation. • Manage XML Schemas in DB2’s XML Schema Repository. For example, you can design new schemas in XMLSpy and register them in DB2, or read existing XML Schemas from DB2, edit them, and save them back into DB2. • Transform XML data for use in other applications. MapForce MapForce is a graphical data mapping and conversion tool to define and maintain relationships between XML, databases, flat files, EDI, and web services. These capabilities help you integrate these artifacts in a service-oriented architecture (SOA) or custom data integration application. MapForce allows DB2 users to • Map XML data, flat files, EDI, and so on directly to and from DB2 databases by assigning an XML Schema to the data using a drag-and-drop interface. • Access, preview, and integrate database data. • Define and deploy filtering of database sources within data mapping projects. • Graphically build web services that retrieve or write data in DB2 databases. StyleVision StyleVision is a visual stylesheet creation tool that allows you to render XML and relational data as HTML, PDF, Word, RTF, OOXML, and electronic forms. DB2 users can use StyleVision to • Create XSLT and XSL:FO stylesheets to publish XML and relational data using dragand-drop interfaces and other features. • Produce multiple output documents in HTML, Word/RTF, and PDF for publishing and exchanging data from a DB2 pureXML database.
658 Chapter 21 Developing XML Applications with DB2 DatabaseSpy DatabaseSpy is a database query and design tool. It allows you to • Graphically view and modify database tables and their relationships as well as XML Schemas. • Manage XML Schemas that are registered in the database. • Write SQL queries with code completion and syntax coloring. • Organize frequently used queries into project files. For further information on these tools and their capabilities for DB2 pureXML, please refer to the following information: • Website: Altova Tools for DB2 pureXML, http://www.altova.com/IBM_DB2_9_pureXML.html, http://www.altova.com/features_db2.html • White Paper: Integration of Altova Tools with IBM DB2 pureXML, http://www.altova.com/whitepapers/ibm.pdf • Tutorial: Using the Altova Tools with IBM DB2 pureXML, http://www.ibm.com/developerworks/db2/library/long/dm-0712kogan/ 21.9.4 <oXygen/> <oXygen/> is a complete XML editor with tools for XML authoring, XML conversion, XML Schema and DTD manipulation, Relax NG and Schematron development, as well as SOAP and WSDL testing. Additionally, <oXygen/> allows you to develop and debug XPath, XSLT, and XQuery. <oXygen/> can connect to XML repositories via WebDAV, Subversion (SVN), and FTP interfaces, and has support to browse and query databases. The <oXygen/> suite of XML tools is available as a stand-alone application and as a plug-in for Eclipse. <oXygen/> offers support for DB2 pureXML that allows you to • Register and manage XML Schemas in the XML Schema Repository of a DB2 database. • Export XML or relational data from DB2 tables to XML output format. • Add, delete or edit data in DB2 tables. If any database constraints are violated, proper error messages allow you to correct the problem. • Open XML documents from DB2 XML columns in the <oXygen/> XML editor, modify the XML data, and save it back to the database. • Validate existing XML data in DB2 against an XML Schema. • Run SQL (including DDLs), SQL/XML, and XQuery against data in DB2 tables.
21.10 Summary 659 For further information and instructions to configure the DB2 support in <oXygen/>, see http://www.oxygenxml.com/IBM_DB2_XML_support.html. 21.9.5 Stylus Studio Stylus Studio is a complete development environment for working with XML documents, XPath and XQuery, XSLT, XSL:FO, EDI, XML Schemas and DTDs, XHTML, XML mapping and publishing, and Web services. It offers functionality to assist with the design, debugging, and maintenance of these artifacts that are commonly used in and around XML applications. Additionally, Stylus Studio supports a separate XQuery processor, called DataDirect (http://www.xquery.com/), which allows you to query relational data in DB2 using the XQuery language. DataDirect converts XQuery into SQL statements that are then executed against relational tables in DB2. Depending on your application you might find it more natural to query relational DB2 data directly in SQL, and use DB2’s XQuery and SQL/XML capabilities to query XML data in DB2 databases. Stylus Studio also allows you to use XML and relational data in DB2 as input for publishing and reporting with XSLT and XSL:FO. For further details on Stylus Studio, see http://www. stylusstudio.com/ibm_db2.html. 21.10 SUMMARY You can use a wide range of programming languages and APIs to develop applications on top of DB2 pureXML. Many popular languages are supported, such as Java, COBOL, C, PL/1, PHP, Perl, or the .NET languages C# and Visual Basic. The newer versions of the database APIs for these languages support the XML data type as well as XQuery and SQL/XML statements. In most languages you can define application variables of type XML that simplify the data exchange between the application code and XML columns in DB2. XML documents can also be inserted or retrieved with character or binary application variables. However, beware of code page issues when you use character type variables in your application. In DB2, all XML data stored is in UTF-8 format. If you retrieve XML documents into binary application variables, the UTF-8 encoding is preserved. If you retrieve XML documents into character variables, the documents are converted to the code page of the application. Not every query on an XML column returns XML data. For example, SQL/XML queries that use the XMLTABLE function can extract individual values from XML documents and convert them to traditional SQL data types such as INTEGER, DECIMAL, DATE, or VARCHAR. Your application can process the result set of such queries just like it normally does for relational queries. Such extraction queries also exploit the parsed storage format of DB2 pureXML; that is, they extract XML values without XML parsing.
660 Chapter 21 Developing XML Applications with DB2 XML parsing is a central topic in XML application development. The general recommendation is to avoid XML parsing in your application as much as possible. Many common XML processing tasks can be done with DB2’s pureXML capabilities more efficiently and simpler than with custom application code. There are tools from various vendors to help you develop XML applications. Depending on the tool that you choose you might find capabilities such as visual XML Schema design, validation of documents against schemas, generation of XML Schemas from sample data and vice versa, design and testing of XSLT style sheets, building and debugging of XQuery expressions, and many other useful features. Some of the available tools offer DB2-specific support, such as direct access to DB2 tables or DB2’s XML Schema Repository.
C H A P T E R 22 Exploring XML Information in the DB2 Catalog I n this chapter we summarize the DB2 catalog information related to managing XML data in DB2. This discussion is split into two sections: • XML information in the catalog views in DB2 for Linux, UNIX, and Windows (section 22.1) • XML information in the catalog tables in DB2 for z/OS (section 22.2) 22.1 XML-RELATED CATALOG INFORMATION IN DB2 FOR LINUX, UNIX, AND WINDOWS With the introduction of pureXML in DB2, several existing catalog tables have been augmented with XML-related information and several catalog tables have been added. All of these tables are discussed in the following sections. 22.1.1 Catalog Information for XML Columns When a table with an XML column is created, this event is recorded in the catalog views SYSCAT.TABLES and SYSCAT.COLUMNS and their underlying tables. The view SYSCAT. COLUMNS contains one entry for each column in a table. XML columns are shown as type XML (see Figure 22.1). Base table inlining for XML columns was introduced in DB2 9.5 for Linux, UNIX, and Windows. Hence, for an XML column in DB2 9.5 and higher, the column INLINE_LENGTH in the view SYSCAT.COLUMNS shows the inline length of that XML column. An inline length of 0 means that the column is not inlined. In DB2 9.7, the column PCTINLINED was added. If statistics have been collected for the table, this column shows the percentage of XML documents that 661
662 Chapter 22 Exploring XML Information in the DB2 Catalog are inlined. With an inline length of 500 bytes, 63% of the documents in the XML column INFO are inlined in the example in Figure 22.1. Inlined XML storage is discussed in section 3.4, Using XML Base Table Row Storage (Inlining). SELECT SUBSTR(tabname,1,10) AS tabname, SUBSTR(colname,1,10) AS colname, SUBSTR(typename,1,10) AS type , inline_length, pctinlined FROM syscat.columns WHERE tabname ='CUSTOMER' ; TABNAME ---------CUSTOMER CUSTOMER CUSTOMER Figure 22.1 COLNAME ---------CID INFO HISTORY TYPE INLINE_LENGTH PCTINLINED ---------- ------------- ---------BIGINT 0 -1 XML 500 63 XML 0 0 Querying the catalog view SYSCAT.COLUMNS 22.1.2 The XML Strings and Paths Tables Two system tables contain information about the XML tags and paths that occur in the XML data that is stored in the database. Both tables contain DB2’s internal information and are not meant to be queried by user applications or database administrators. The system table SYSIBM.SYSXMLSTRINGS was discussed in section 3.2, Understanding pureXML Storage. It contains the database-wide mapping from XML tag names to the stringIDs that are used in DB2’s internal XML representation. The table consists of three columns, shown in Figure 22.2. DESCRIBE TABLE sysibm.sysxmlstrings ; Column name ---------------STRINGID STRING IS_TEMPORARY Data type Column schema Data type Length Scale Nulls --------- ----------- ---------- ----- -----SYSIBM INTEGER 4 0 No SYSIBM VARCHAR 1001 0 No SYSIBM CHARACTER 1 0 No 3 record(s) selected. Figure 22.2 Description of the catalog table SYSIBM.SYSXMLSTRINGS Figure 22.3 shows how you can query the catalog table SYSIBM.SYSXMLSTRINGS. Since the STRING column contains the tag names in hexadecimal form, you need to convert them to character strings, using the function SYSIBM.XMLBIT2CHAR.
22.1 XML-Related Catalog Information in DB2 for Linux, UNIX, and Windows 663 SELECT stringid, SUBSTR(SYSIBM.XMLBIT2CHAR(string),1,50), is_temporary FROM sysibm.sysxmlstrings; Figure 22.3 Querying the table SYSIBM.SYSXMLSTRINGS The second system table is called SYSIBM.SYSXMLPATHS. It maps paths to pathIDs, much like the table SYSXMLSTRINGS maps tags to stringIDs. The table contains three columns, as shown in Figure 22.4. The paths are stored in a binary format. Column name -------------PATHID PATHTYPE PATH Figure 22.4 22.1.3 Data type schema --------SYSIBM SYSIBM SYSIBM Column Data type Length Scale Nulls --------------- ---------- ----- ----INTEGER 4 0 No CHARACTER 1 0 No VARCHAR 1000 0 No Columns in the catalog table SYSIBM.SYSXMLPATHS The Internal XML Regions and Path Indexes When you create a table with one or more XML columns, one XML path index is automatically created for each XML column. DB2 also creates a single XML regions index for all XML columns in a table (see section 3.3, XML Storage in DB2 for Linux, UNIX, and Windows). These internal indexes associated with XML columns are distinct from XML indexes defined by you. Table 22.1 shows the possible values for the column INDEXTYPE in the catalog view SYSCAT.INDEXES where all indexes are recorded. The XML path index has an INDEXTYPE of XPTH, and the XML regions index is indicated by XRGN. They are internal indexes for use by DB2 and do not appear in query execution plans. The index types XVIL and XVIP stand for the logical and physical representation of a user-defined XML index (discussed in the next section). Table 22.1 Column INDEXTYPE in the catalog view SYSCAT.INDEXES Column Name Data Type Description INDEXTYPE CHAR (4) Type of index: BLOK—Block index CLUS—Clustering index (controls the physical placement of newly inserted rows) DIM—Dimension block index REG—Regular index XPTH—XML path index XRGN—XML regions index XVIL—Index over XML column (logical) XVIP—Index over XML column (physical)
664 Chapter 22 22.1.4 Exploring XML Information in the DB2 Catalog Catalog Information for User-Defined XML Indexes In the DB2 catalog you find that a user-defined XML index is internally represented by a logical index and a physical index. The corresponding index types in SYSCAT.INDEXES are XVIL and XVIP, respectively (refer to Table 22.1). The logical index contains just the index definition, while the physical index contains the actual B-tree structure. The purpose of this separation is to leave room and flexibility for more advanced indexing implementations in the future. For example, it is conceivable that a single logical index can be represented by multiple physical B-trees. But, up to DB2 9.7, there is a one-to-one relationship between logical and physical indexes. For each index that you define there are two entries in SYSCAT.INDEXES. The name of the logical index is the name you specified in the CREATE INDEX statement. It is also the name that appears in an access plan. The physical index name is system generated and cannot be influenced. You can list the indexes for the customer table as shown in Figure 22.5. SELECT SUBSTR(indschema,1,10) AS indschema, SUBSTR(indname,1,20) AS indname, SUBSTR(tabschema,1,10) AS tabschema, SUBSTR(tabname,1,10) AS tabname, indextype AS type FROM syscat.indexes WHERE tabname = 'CUSTOMER'; INDSCHEMA ---------SYSIBM SYSIBM SYSIBM DB2ADMIN DB2ADMIN SYSIBM INDNAME -------------------SQL080729100207180 SQL080729100207390 SQL080729100207420 PK_CUSTOMER CUST_CID_XMLIDX SQL080729100209890 TABSCHEMA ---------DB2ADMIN DB2ADMIN DB2ADMIN DB2ADMIN DB2ADMIN DB2ADMIN TABNAME ---------CUSTOMER CUSTOMER CUSTOMER CUSTOMER CUSTOMER CUSTOMER TYPE ----XRGN XPTH XPTH REG XVIL XVIP 6 record(s) selected. Figure 22.5 Listing the indexes for a table The first row in the result set in Figure 22.5 shows the XML regions index of the customer table (XRGN). The next two rows show the two path indexes of the customer table. There are two path indexes because the customer table in the DB2 sample database has two XML columns, info and history (see Figure 22.1). Further, the query result shows a regular relational index (REG) and a user-defined XML index called CUST_CID_XMLIDX, which is the logical XML index name. The index with the name SQL080729100209890 is the corresponding physical index, which is not obvious unless you consult the new catalog view SYSCAT.INDEXXMLPATTERNS. This catalog view contains the relationships between physical and logical XML indexes. Figure 22.6 shows how to find the XML physical index for a given logical index name.
22.1 XML-Related Catalog Information in DB2 for Linux, UNIX, and Windows 665 SELECT SUBSTR(indname,1,20) AS indname, SUBSTR(pindname,1,20) AS pindname FROM syscat.indexxmlpatterns WHERE indname = 'CUST_CID_XMLIDX'; INDNAME PINDNAME -------------------- -------------------CUST_CID_XMLIDX SQL080729100209890 Figure 22.6 Listing the physical index name for a given logical index name Table 22.2 summarizes the columns of the catalog view SYSCAT.INDEXXMLPATTERNS. Table 22.2 SYSCAT.INDEXXMLPATTERNS Column Name Data Type Nullable INDSCHEMA VARCHAR(128) Relational schema name of the logical index INDNAME VARCHAR(128) Unqualified name of the logical index PINDNAME VARCHAR(128) Unqualified name of the physical index PINDID SMALLINT Identifier for the physical index TYPEMODEL CHAR(1) Q = SQL DATA TYPE (Ignore invalid values) R = SQL DATA TYPE (Reject invalid values) DATATYPE VARCHAR(128) Name of the data type HASHED CHAR(1) Indicates whether the value is hashed N = Not hashed Y = Hashed LENGTH SMALLINT VARCHAR(n) length; 0 otherwise PATTERNID SMALLINT Identifier for the pattern PATTERN CLOB(2M) Y Description XMLPATTERN used in the index definition The column PATTERN contains the XPath expression that was used in the XMLPATTERN clause of the XML index definition. The query in Figure 22.7 reveals the XMLPATTERN and data type that were used in the CREATE INDEX command of the index CUST_PHONES_XMLIDX.
666 Chapter 22 Exploring XML Information in the DB2 Catalog SELECT SUBSTR(indname,1,20) AS indname, SUBSTR(pattern,1,30) AS pattern, SUBSTR(datatype,1,10) AS datatype, length FROM syscat.indexxmlpatterns WHERE indname = 'CUST_PHONES_XMLIDX'; INDNAME PATTERN DATATYPE LENGTH ------------------- -------------------- ---------- -----CUST_PHONES_XMLIDX /customerinfo/phone VARCHAR 25 Figure 22.7 Obtaining information about an XML index The view SYSCAT.INDEXXMLPATTERNS does not contain a column for the table name. To list all XML index definitions for a specific table, use a join between the views SYSCAT.INDEXXMLPATTERNS and SYSCAT.INDEXES (see Figure 22.8). SELECT SUBSTR(p.indname,1,20) AS indname, SUBSTR(p.pattern,1,80) AS pattern, SUBSTR(p.datatype,1,10) AS datatype FROM syscat.indexxmlpatterns p, syscat.indexes i WHERE p.indschema = i.indschema AND p.indname = i.indname AND i.tabname = 'CUSTOMER'; INDNAME -------------------CUST_CID_XMLIDX CUST_NAME_XMLIDX CUST_PHONES_XMLIDX CUST_PHONET_XMLIDX Figure 22.8 PATTERN -------------------------/customerinfo/@Cid /customerinfo/name /customerinfo/phone /customerinfo/phone/@type DATATYPE ---------DOUBLE VARCHAR VARCHAR VARCHAR Listing all XML index definitions for a given table When you issue the RUNSTATS command to collect statistics for an XML index you can use either the logical or the physical index name, as shown in Figure 22.9. No matter which name you use, statistics are only recorded for and associated with the physical XML index. XML index statistics are explained in Chapter 13, Defining and Using XML Indexes. RUNSTATS ON TABLE db2admin.customer FOR INDEXES db2admin.cust_cid_xmlidx; RUNSTATS ON TABLE db2admin.customer FOR INDEXES sysibm.SQL080729100209890; Figure 22.9 Running RUNSTATS on an XML index
22.2 XML-Related Catalog Information in DB2 for z/OS 22.1.5 667 Catalog Information for XML Schemas A set of catalog tables and views have been introduced to manage XML Schemas and Document Type Definitions (DTDs). Collectively, these tables and views are called the XML Schema Repository (XSR). The XSR also includes commands and stored procedures to add, update, and remove XML Schemas in the XSR. There is one XSR per database. The information in the XML Schema Repository is exposed to the user through seven catalog views: • SYSCAT.XSROBJECTS • SYSCAT.XSROBJECTCOMPONENTS • SYSCAT.XDBMAPGRAPHS • SYSCAT.XSROBJECTAUTH • SYSCAT.XSROBJECTDEP • SYSCAT.XSROBJECTHIERARCHIES • SYSCAT.XDBMAPSHREDTREES These catalog views, their content, and the commands to manage XML Schemas are described in Chapter 16, Managing XML Schemas. 22.2 XML-RELATED CATALOG INFORMATION IN DB2 FOR Z/OS Some of the XML-related catalog tables in DB2 for z/OS are similar to the corresponding catalog views in DB2 for Linux, UNIX, and Windows. For example, DB2 for z/OS has an XML Schema Repository (XSR) just like DB2 for Linux, UNIX, and Windows. XML columns, tables, and table spaces are also recorded in the DB2 for z/OS catalog tables. Remember, that when a table is created with an XML column, DB2 for z/OS automatically creates an XML table space, an XML table, a node ID index, and document ID index. All of these are listed in catalog tables. 22.2.1 Catalog Information for XML Storage Objects The catalog table SYSIBM.SYSXMLRELS contains one row for each XML column. The entries in this table correlate XML columns to the user tables that they logically belong to and to the internal XML tables where they are physically stored. Table 22.3 explains the columns of this catalog table.
668 Table 22.3 Chapter 22 Exploring XML Information in the DB2 Catalog SYSIBM.SYSXMLRELS Column Data Type Description TBOWNER VARCHAR(128) NOT NULL Schema or qualifier of the base table. TBNAME VARCHAR(128) NOT NULL Name of the base table. COLNAME VARCHAR(128) NOT NULL Name of the XML column in the base table. XMLTBOWNER VARCHAR(128) NOT NULL Schema or qualifier of the internal XML table. XMLTBNAME VARCHAR(128) NOT NULL Name of the internal XML table. XMLRELOBID INTEGER NOT NULL Internal identifier of the relationship between the base table and the XML table. IBMREQD CHAR(1) NOT NULL The value Y indicates that the row came from the machine-readable material (MRM) tape. CREATEDTS TIMESTAMP NOT NULL Time when the XML table was created. RELCREATED CHAR(1) NOT NULL The release of DB2 that is used to create the object. The table SYSIBM.SYSXMLSTRINGS acts as DB2’s internal dictionary of XML tags. Each row contains an XML tag and the corresponding unique integer ID that DB2 uses to compress XML data (see Table 22.4). The tag can be an element name, attribute name, namespace prefix, or a namespace URI. DB2’s internal use of this table is illustrated in Chapter 3, Designing and Managing XML Storage Objects. Table 22.4 SYSIBM.SYSXMLSTRINGS Column Data Type Description STRINGID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY STRING. Unique ID for the XML tag in the column STRING VARCHAR(1000) NOT NULL The XML tag. IBMREQD CHAR(1) NOT NULL A value of Y indicates that the row came from the basic machine-readable material (MRM) tape. There are also existing catalog tables that have been augmented with information about XML objects. For example, the catalog table SYSIBM.SYSTABLES contains one row for every table in the database. Its column TYPE has the value P if the table is an internal XML table (see Table 22.5). This value allows you to distinguish explicitly created user tables from implicitly created XML tables.
22.2 XML-Related Catalog Information in DB2 for z/OS Table 22.5 669 Column TYPE in the catalog table SYSIBM.SYSTABLES Column Name Data Type TYPE CHAR(1) NOT NULL Description Type of object: A—Alias C—Clone table G—Created global temporary table M—Materialized query table P—Implicit table created for XML columns T—Table V—View X—Auxiliary table You can list tables and internal XML tables using the query in Figure 22.10. The output shows the base table (type T) and the internal XML table (type P). Note that internal XML table names always start with an X. SELECT SUBSTR(creator,1,10) AS creator, SUBSTR(name,1,30) AS name, type FROM sysibm.systables WHERE name like '%CUST%' # ---------+---------+---------+--------CREATOR NAME TYPE ---------+---------+---------+--------USER011 CUSTOMER T USER011 XCUSTOMER P Figure 22.10 Listing XML tables in DB2 for z/OS When you create a table with an XML column in DB2 for z/OS, DB2 automatically assigns names to the internal tables and table spaces that physically store the XML columns. You cannot influence their names. DB2 assigns a table space name of the form xxxxyyyy, where xxxx is the first four characters of your table name, and yyyy is a number that guarantees uniqueness. The example in Figure 22.11 creates a table called customer, which contains a relational column and three XML columns. The query in Figure 22.11 shows the table and table space names for the customer table and the three internal XML tables that are created, one for each XML column in the customer table.
670 Chapter 22 Exploring XML Information in the DB2 Catalog CREATE TABLE customer (id INT, info XML, info2 XML, info3 XML); SELECT SUBSTR(name,1,15) AS tabname, SUBSTR(tsname,1,15) AS tsname, type FROM sysibm.systables WHERE name LIKE '%CUST%' # ---------+---------+---------+---------+---TABNAME TSNAME TYPE ---------+---------+---------+---------+---CUSTOMER CUSTOMER T XCUSTOMER XCUS0000 P XCUSTOMER000 XCUS0001 P XCUSTOMER001 XCUS0002 P DSNE610I NUMBER OF ROWS DISPLAYED IS 4 Figure 22.11 Table spaces for a table with three XML columns in DB2 for z/OS Another useful way to retrieve information about a table and its related XML table spaces is shown in Figure 22.12. SELECT SUBSTR(X.xmltbowner, 1, 15) AS owner, SUBSTR(X.xmltbname, 1, 20) AS name, T.type, T.dbname, T.tsname FROM SYSIBM.SYSTABLES T, SYSIBM.SYSXMLRELS X WHERE X.tbname = 'CUSTOMER' AND T.name = X.xmltbname AND T.creator = X.xmltbowner ; Figure 22.12 Listing XML table spaces in DB2 for z/OS The catalog table SYSIBM.SYSTABLESPACE contains one row for every table space in the database. A row that represents an internal XML table space has the value P in the column TYPE, and the value X in the column LOCKRULE (see Table 22.6). Table 22.6 Columns TYPE and LOCKRULE in the table SYSIBM.SYSTABLESPACE Column Name Data Type Description TYPE CHAR(1) NOT NULL WITH DEFAULT P—Implicit table space created for XML columns. (…) LOCKRULE CHAR(1) NOT NULL Lock size of the table space: A—Any L—Large object (LOB) P—Page R—Row S—Table space T—Table X—Implicitly created XML table space
22.2 XML-Related Catalog Information in DB2 for z/OS 22.2.2 671 Catalog Information for XML Indexes The catalog table SYSIBM.SYSINDEXES contains one row for every index. The value V in the column IX_EXTENSION_TYPE indicates a user-defined XML index while the value N identifies DB2’s internal node ID index (see Table 22.7). Table 22.7 Column IX_EXTENSION_TYPE in the table SYSIBM.SYSINDEXES Column Name Data Type Description IX_EXTENSION_TYPE CHAR(1) NOT NULL WITH DEFAULT Identifies the type of extended index N—Node ID index S—Index on a scalar expression T—Spatial index V—XML index blank—Simple index You can list all user-defined XML indexes with the query in Figure 22.13. SELECT ixcreator, ixname, tbname FROM sysibm.sysindexes WHERE ix_extension_type = 'V' Figure 22.13 Listing XML indexes in DB2 for z/OS If you want to check which XML elements and attributes are indexed you need to examine the XPath expressions that were used in the XMLPATTERN clause of your XML index definitions. The XMLPATTERN of an XML index is listed in the column DERIVED_FROM of the catalog table SYSIBM.SYSKEYTARGETS. In the same table, the column CARDF contains the number of distinct documents that are indexed (see Table 22.8). Table 22.8 Columns DERIVED_FROM and CARDF in the table SYSIBM.SYSKEYTARGETS Column Name Data Type Description DERIVED_FROM VARCHAR(4000) NOT NULL For an XML index, this column contains the XML pattern that generates the key values. For an index on a scalar expression, this column contains the text of the scalar expression that generates the keys. For any other indexes, this column is empty. CARDF FLOAT NOT NULL The number of distinct values for the key. For a user-defined XML index, this value is collected for the second key target (the DOCID). For all other key targets of an XML index, the value is -2. The value is also -2 if the index is an internal XML node ID index.
672 Chapter 22 Exploring XML Information in the DB2 Catalog The example in Figure 22.14 illustrates how the XMLPATTERN used in a CREATE INDEX statement can subsequently be retrieved from the catalog table SYSIBM.SYSKEYTARGETS. Note that the table SYSKEYTARGETS can contain multiple rows for a given index, and only the row with KEYSEQ=1 contains the XMLPATTERN. Similarly, the cardinality of the DOCIDs is listed in the column CARDF where KEYSEQ=2. CREATE INDEX cust_idx2 ON customer(info) GENERATE KEYS USING XMLPATTERN '/customerinfo/phone' AS SQL VARCHAR(50) # SELECT SUBSTR(IXNAME,1,10) AS ixname, SUBSTR(DERIVED_FROM,1,22) AS xmlpattern, SUBSTR(TYPENAME,1,10) AS typename, LENGTH FROM SYSIBM.SYSKEYTARGETS WHERE IXNAME = 'CUST_IDX2' AND KEYSEQ=1 # ---------+---------+---------+---------+---------+---------+IXNAME XMLPATTERN TYPENAME LENGTH ---------+---------+---------+---------+---------+---------+CUST_IDX2 /customerinfo/phone VARCHAR 50 Figure 22.14 22.2.3 Obtaining the XMLPATTERN for an existing XML index Catalog Information for XML Schemas Table 22.9 provides a summary of the XML Schema Repository tables, which are new in DB2 9 for z/OS. They are explained in more detail in Chapter 16, Managing XML Schemas. Table 22.9 XML Schema Repository Tables in DB2 for z/OS Table Name Description SYSIBM.XSROBJECTS Contains one row for each registered XML Schema. Rows in this table can only be changed using the DB2-supplied XSR stored procedures and commands. SYSIBM.XSROBJECTGRAMMAR An auxiliary table for the BLOB column GRAMMAR in SYSIBM.SYSXSROBJECTS. This table is in LOB table space SYSXSRA1. SYSIBM.XSROBJECTPROPERTY An auxiliary table for the BLOB column PROPERTIES in SYSIBM.SYSXSROBJECTS. This table is in LOB table space SYSXSRA2.
22.3 Summary Table 22.9 673 XML Schema Repository Tables in DB2 for z/OS (Continued) Table Name Description SYSIBM.XSROBJECTCOMPONENTS Contains one row for each component (schema document) of an XML Schema. Rows in this table can only be changed using the DB2-supplied XSR stored procedures and commands. SYSIBM.XSRCOMPONENT Auxiliary table for the BLOB column COMPONENT in SYSIBM.SYSXSROBJECTCOMPONENTS. This table is in LOB table space SYSXSRA3. SYSIBM.XSRPROPERTY An auxiliary table for the BLOB column COMPONENT in SYSIBM.SYSXSROBJECTCOMPONENTS. This table is in LOB table space SYSXSRA3. SYSIBM.XSROBJECTHIERARCHIES Contains one row for each component (document) of an XML Schema to record the XML Schema document hierarchy. 22.3 SUMMARY With the introduction of pureXML in DB2, several new types of database objects can exist in a database. Depending on the platform, they can include XML columns, XML tables, XML table spaces, XML Schemas, as well as user-defined and system-defined XML indexes. New catalog tables have been introduced and existing catalog tables extended to store appropriate metadata about these new objects. As a result, users can run catalog queries to learn about the existing XML objects much like they normally do for relational objects.
This page intentionally left blank
C H A P T E R 23 Test Your Knowledge— The DB2 pureXML Quiz his chapter contains multiple choice questions based on the content of the book. You can use these questions to test your knowledge and revisit specific topic areas. Each question comes with four or five possible answers, (a) through (e). There is exactly one correct answer, unless otherwise stated. The solutions are at the end of the chapter. Many questions apply to both DB2 for z/OS and DB2 for Linux, UNIX, and Windows. Questions that are platform-specific mention the applicable platform explicitly. T There is an official IBM pureXML Technical Mastery Test that you can take to certify your DB2 pureXML expertise. Further information on that test is available at http://www.ibm.com/ certify/mastery_tests/objM34.shtml. The questions in this chapter are different from the questions in the mastery test, but cover many of the same topics. 23.1 DESIGNING XML DATA AND APPLICATIONS 1. Which of the elements shown is well-formed? (a) <name title=Mr></name> (b) <name title="Mr"><name> (c) <name title="Mr"><name/> (d) <name title=Mr><name/> (e) <name title="Mr"></name> 675
676 Chapter 23 2. Test Your Knowledge—The DB2 pureXML Quiz When choosing the size and granularity of XML documents that you want to store in DB2, which guideline is correct? (a) Choose the XML document granularity with respect to the logical business objects and the anticipated predominant granularity of access. (b) Try to make the stored XML documents as large as possible. (c) Choose the XML document granularity depending on the page size of the table. (d) Try to keep all XML documents between 100KB and 1MB since smaller or larger documents tend to yield lower performance. (e) Try to make the stored XML documents as small as possible. 3. When should you use elements or attributes in your XML documents? (a) Always use XML elements because attributes are supported in XML only for backward compatibility. (b) Attributes are often better for non-Unicode values. (c) There are no clear rules, but elements are more flexible. They can be repeated and nested. (d) Attributes are for numeric data only. (e) Answers (b) and (c). 4. You want to encode information about a product. The product is a blue jacket of size 42. Which of the following three XML formats is preferable? Format 1 <product> <Jacket> <size>42</size> <color>blue</color> </Jacket> </product> Format 2 <product> <type>Jacket</type> <size>42</size> <color>blue</color> </product> Format 3 <product> <field name="type" value="Jacket"> <field name="size" value="42"> <field name="color" value="blue"> </product> (a) All three options are equally good. (b) Format 1. (c) Format 2. (d) Format 3. (e) Format 2 and Format 3 are the best options, and equally preferable.
23.2 Designing and Managing Storage Objects for XML 5. 677 What is the maximum XML document size that you can insert into or read from a DB2 table? (a) Depends on the page size. (b) 4KB. (c) 2GB. (d) 64KB. (e) There is no upper size limit. 23.2 DESIGNING AND MANAGING STORAGE OBJECTS FOR XML 6. What is the value of the title element in the following XML document? <title>The <bold>DB2 pureXML</bold> Cookbook</title> (a) The <bold>DB2 pureXML</bold> Cookbook (b) <title>The <bold>DB2 pureXML</bold> Cookbook</title> (c) The DB2 pureXML Cookbook (d) TheDB2pureXMLCookbook (e) The title element has no value as such—it is a construct. 7. How do you create an XML column called info in a DB2 table? (a) CREATE TABLE cust(info XML) (b) CREATE TABLE cust(info XML_TYPE) (c) CREATE TABLE cust(info XML AS CLOB) (d) CREATE TABLE cust(info XML(n)) where n is the maximum length of the documents. (e) You cannot create an XML column directly. You need to create a table with a primary key column first (column name DOCID, data type BIGINT) and then use the ALTER TABLE command to add an XML column. 8. In DB2 for Linux, UNIX, and Windows, can you store XML columns in a separate table space? (a) No. XML columns are stored in the same table space as the rest of the table. (b) Yes. The CREATE TABLE statement now has a clause XML IN <tablespace>. (c) Yes. If the CREATE TABLE statement has the clause LONG in <tablespace>, then XML is stored in the long table space. (d) Yes. A new XML clause in the CREATE DATABASE statement allows for this. (e) Yes—if the CREATE TABLESPACE statement has the clause ALLOW XML DATA.
678 Chapter 23 9. Test Your Knowledge—The DB2 pureXML Quiz In DB2 for z/OS, when a base table is created with an XML column, an additional internal XML table is created automatically. How many columns does this internal table have? (a) 1 (b) 3 (c) 5 (d) As many columns as there are XML columns in the base table (e) As many columns as there are total columns in the base table 10. In DB2 for z/OS, which page size is used for the internal XML table? (a) The same page size as the base table space (b) 8KB (c) 32KB (d) 16KB (e) 128KB, which is a new page size in DB2 9 and used for XML data only 11. In DB2 for Linux, UNIX, and Windows, when is a Regions index created? (a) Every time you create an index on an XML column. (b) When you create more than one index on an XML column. (c) When you create an index consisting of more than one element. (d) One Regions index is created automatically for each XML column when a table is created. (e) One Regions index is created automatically for each table that contains one or more XML columns. 12. In DB2 for Linux, UNIX, and Windows, what does base table row storage or inlining mean? (a) It means that all XML data is stored in a DMS table space regardless of what was specified when the table space was created. (b) It means that all XML data is stored in a SMS table space regardless of what was specified when the table space was created. (c) It means that all data in a specified table is copied into the bufferpool when the database is activated. This table is defined as being inlined in the bufferpool. (d) It means that XML documents are stored next to the relational data on pages of the DAT object of the table space. (e) It means that relational column values are stored within an XML document in the same row.
23.2 Designing and Managing Storage Objects for XML 679 13. Which of the following statements are correct about DB2 for z/OS? (a) Each internal XML table resides in a “partition by growth” (PBG) table space, whenever the base table is simple, segmented, partitioned, or partitioned by growth. (b) An internal XML table is clustered by DOCID and MIN_NODEID. (c) The DocID index and the NodeID indexes are always created as non-partitioned indexes. (d) An internal XML table space inherits the COMPRESS YES parameter from the base table space. (e) All of the above. 14. In DB2 for Linux, UNIX, and Windows, how do you reorganize XML data? (a) You use the REORG utility in either “online” or “offline” mode. (b) You can only use the REORG utility in “offline” mode and have to specify the keyword LONGLOBDATA in the REORG command. (c) You can use the REORG utility in “online” mode but have to specify the keyword INCLUDEXML in the REORG command. (d) You do not need to reorganize the tables as the data is not relational and therefore does not require reorganizing. (e) There is a new utility called REORG_XML, which has been specifically introduced to deal with XML data (as the structure of XML data is different from relational data). 15. In DB2 9.7 for Linux, UNIX, and Windows, are XML columns allowed in range partitioned tables and MDC tables? (a) Yes in both (b) Only in range partitioned tables (c) Only in MDC tables (d) Yes, but only in a partitioned database (DPF) (e) No; not allowed in either 16. In DB2 for z/OS, the column DB2_GENERATED_DOCID_FOR_XML is automatically created in each table with an XML column. How can you query this column? (a) You can select from it just like any other column in the table. (b) It is an internal column and you cannot select from it.
680 Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz (c) You cannot use SELECT * to view the contents of the column, but have to name the column explicitly in the SELECT statement. (d) You cannot select from the column, but have to use the UNLOAD utility to view the contents. (e) You can view the contents using the CHECKDATA utility. 17. In DB2 for z/OS, which are the new ZPARM parameters that allow you to limit the amount of DB2 memory used for XML processing? Select two answers. (a) MEMXMLA (b) MEMXMLS (c) XMLMEM (d) XMLVALA (e) XMLVALS 23.3 INSERTING AND RETRIEVING XML DATA 18. According to the XML standard, which of the following is NOT considered a whitespace character? (a) space (b) carriage return (c) line feed (d) tab (e) backspace 19. Which of the following characters may not appear in an XML element value, unless they are properly escaped? There are two correct answers. (a) ampersand (&) (b) question mark (?) (c) the “at” sign (@) (d) pound/hash sign (#) (e) less-than symbol (<) 20. Which of the following statements about INSERT statements is NOT correct? (a) You can insert XML documents into an XML column using parameter markers. (b) You can insert XML documents into an XML column using host variables.
23.4 Moving XML Data 681 (c) You can insert XML documents only if their encoding is UTF-8. (d) You can insert XML documents from a CLOB column into an XML column. (e) You can insert XML documents and choose whether to strip or preserve whitespace. 21. When you insert an XML document with an XML declaration, can there be whitespace preceding the XML declaration in the document (as shown in the following)? INSERT INTO myshelf VALUES(10,' <?xml version="1.0"?>… (a) Yes, the spaces are ignored. (b) No, the insert will fail. (c) Yes, but only if you do not validate the document with an XML Schema. (d) Yes, if the spaces are declared as a namespace. (e) Yes, provided the DB2 registry variable CURRENT IMPLICIT XMLPARSE OPTION is set to STRIP WHITESPACE. 23.4 MOVING XML DATA 22. Which of the following statements are correct about the LOAD utility in DB2 for z/OS? There are two correct answers. (a) The LOAD utility treats XML columns as variable-length data when loading XML directly from input records, and expects a two-byte length field preceding the actual XML value. (b) By default, the LOAD utility preserves whitespace when loading XML data. (c) XML documents that don’t fit into 32KB input records can be loaded from separate files. (d) When loading XML data, you need to specify the name of the internal XML table in the load job. (e) The LOAD utility does not check whether the input documents are well-formed. 23. Which function can split a large XML document into smaller XML documents? (a) XMLSPLIT (b) XMLSERIALIZE (c) XMLSHRED (d) XMLTABLE (e) SUBSTR
682 Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz 24. The Export utility in DB2 for Linux, UNIX, and Windows supports which of the following options for exporting XML data? There are two correct answers. (a) Relational and XML columns of a table can be exported side by side into a single data file. (b) Exported XML documents can be written to individual files, with one file per XML document. (c) Exported XML documents can be concatenated to produce several large files, with one file per output directory. (d) If an XML column contains documents for multiple XML Schemas, the Export utility can produce one data file per schema, with each file containing the documents for a given schema. (e) If an XML column contains documents that have been validated, the XML Schemas for the exported documents can optionally be written to a separate file. 23.5 QUERYING XML 25. Which of the following statements is true? (a) DB2 allows XPath expressions to be embedded in SQL, based on the SQL/XML standard. (b) DB2 allows XPath expressions to be embedded in SQL, using DB2 proprietary functions. (c) DB2 allows the use of XPath expressions only for XML documents that have been successfully validated with an XML Schema. (d) DB2 allows the use of XPath in SQL SELECT statements, but not in INSERT, UPDATE, or DELETE statements. (e) Both (a) and (d) are correct. 26. Why is the following type of query typically not useful? SELECT XMLQUERY('$p/customerinfo/addr[pcode-zip = "95141"]' PASSING info AS "p") FROM customer (a) It can never use an index to evaluate the predicate on pcode-zip. (b) It returns as many result rows as there are rows in table customer. (c) It returns empty rows for those documents where pcode-zip is not 95141. (d) There is no predicate in a WHERE clause to allow filtering of rows. (e) All of the above.
23.5 Querying XML 683 27. What does the following query return? SELECT description FROM product WHERE XMLEXISTS('$p/product/id = 178' PASSING description AS "p") (a) Nothing. Zero rows are returned because the syntax of the predicate is invalid. (b) All documents where the product id element in the XML column description has the value 178. (c) All documents described in (b), as well as empty rows for those XML documents were product id is not 178. (d) All documents from table product. (e) NULL. 28. Consider the following XML document: <customerinfo Cid="1099"> <name>Matt Foreman</name> <addr type="Work"> <street>12 Short Lane</street> <city>Toronto</city> </addr> <addr type="Home"> <street>1596 Baseline</street> <city>Toronto</city> </addr> </customerinfo> Which of the following XPath expressions returns <name>Matt Foreman</name>? (1) /customerinfo[addr/city="Toronto"]/name (2) /customerinfo[.//city="Toronto"]/name (3) //addr[city="Toronto"]/../customerinfo/name (4) /customerinfo/addr[/city="Toronto"]/../name (5) //name[..//city="Toronto"] (a) All of the above (b) All of the above, except (3) (c) (1), (2), and (4) (d) (1), (2), and (5) (e) (3) and (4)
684 Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz 29. Assume the two documents shown below are stored in the XML column doc of table tab. Which of the documents are returned by the query below? Document A <Test> <A> </A> <A> </A> </Test> <B>5</B> <C> </C> <F>6</F> Document B <Test> <A> </A> </Test> <B>5</B> <C> <F>6</F> </C> SELECT doc FROM tab WHERE XMLEXISTS('$i/Test/A[B=5 and C/F=6]' PASSING doc AS "i") (a) Document A (b) Document B (c) Both documents (d) Neither document 30. Which of the following are valid functions in DB2 for Linux, UNIX, and Windows? There are two correct answers. (a) db2-fn:xmlcolumn (b) db2-fn:sqlquery (c) db2-fn:columnxml (d) db2-fn:xmldocument (e) db2-fn:sqlxml 31. Which of the following are not XQuery keywords? There are two correct answers. (a) for (b) let (c) when (d) order by (e) result
23.5 Querying XML 685 32. What happens if you run the following query against an XML document where the value of the XML element pcode-zip is the string “NE1 XQ7”? SELECT info FROM customer WHERE XMLEXISTS('$INFO/customerinfo/addr[pcode-zip = 95141]') (a) No data is returned from the document. The query succeeds but returns zero rows. (b) The document is selected and returned. (c) The query is rejected at compile time due to a data type mismatch in the predicate. (d) The query fails at runtime when it tries to compare the string value of the pcodezip element to the numeric value 95141. (e) All of the above can happen, depending on the XML Schema being used. 33. Consider the following table, queries, and results. If you run both queries, what results will you get back? Table tab mycol (XML) <a> <b> 1 </b> <b> 2 </b> </a> <a> <b> 3 </b> <b> 4 </b> </a> Query A: XQUERY for $i in db2-fn:xmlcolumn("TAB.MYCOL")/a/b return $i Query B: SELECT XMLQUERY('$col/a/b' PASSING tab.mycol AS "col" ) FROM tab
686 Chapter 23 Result Set 1 <b> 1 </b> <b> 2 </b> <b> 3 </b> <b> 4 </b> Test Your Knowledge—The DB2 pureXML Quiz Result Set 2 <b> 1 </b><b> 2 </b> <b> 3 </b><b> 4 </b> (a) Both queries return results set 1 (four rows). (b) Both queries return results set 2 (two rows). (c) Query A returns result set 1, query B result set 2. (d) Query A returns result set 2, query B result set 1. (e) Both queries return nothing (because no namespace is specified). 23.6 PRODUCING XML FROM RELATIONAL DATA 34. You can use the XMLFOREST function as an abbreviation for which function? (a) XMLATTRIBUTES (b) XMLCONCAT (c) XMLAGG (d) XMLCOMMENT (e) XMLELEMENT 35. Which function supersedes the function XML2CLOB, which was introduced in DB2 V8 to convert constructed XML data from type XML to type CLOB? (a) XMLDOCUMENT (b) XMLCOMMENT (c) XMLSERIALIZE (d) XMLTEXT (e) XML2CHAR 36. Which SQL/XML function can have an optional ORDER BY clause? (a) XMLATTRIBUTES (b) XMLCONCAT (c) XMLAGG (d) XMLFOREST (e) XMLELEMENT
23.7 Converting XML to Relational Data 687 37. Assume that pid and name are relational columns in a table. Which of the following is correct usage of XQuery direct element and attribute constructors to construct an element with an attribute? (a) <product pid={$PID}>{$NAME}</product> (b) <product pid="{$pid}">{$name}</product> (c) <product pid="$pid">$name</product> (d) <product pid={"$PID"}>{$NAME}</product> (e) <product pid="{$PID}">{$NAME}</product> 23.7 CONVERTING XML TO RELATIONAL DATA 38. Which statement about shredding is NOT true? (a) You can perform transformations of the data values before insert into relational columns. (b) You can shred the same element or attribute value into at most one column. (c) You can shred multiple different elements or attributes into the same column of a table. (d) You can specify conditions that govern when certain elements are or are not shredded. (e) You can validate XML documents with an XML Schema during shredding. 39. Which SQL/XML function(s) can you use to shred XML into relational data? (a) DECOMP_XML (b) XMLSHRED (c) XMLDECOMP (d) XMLTABLE (e) Both (b) and (d) 40. When you annotate an XML Schema for shredding, what is the purpose of the annotation db2-xdb:normalization? (a) It ensures that relational target tables are fully normalized before shredding. (b) It converts all XML values to data type xs:string before insertion into relational tables. (c) It specifies how to treat whitespaces in the XML documents that are shredded. (d) It converts XML fragments to canonical XML format. (e) It converts all XML attributes to elements before shredding.
688 Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz 41. Which schema annotation do you use to define multiple mappings for the same XML element or attribute? (a) db2-xdb:multimap (b) db2-xdb:contentHandling (c) db2-xdb:expression (d) db2-xdb:rowSetMapping (e) db2-xdb:condition 42. Can you create a relational view over XML data? (a) Yes (b) No (c) Yes, provided that the XML data has been validated (d) Yes, but the view has to be written in XQuery (e) Yes, but only with the XML View Wizard in IBM Data Studio Developer 23.8 UPDATING AND TRANSFORMING XML DOCUMENTS 43. In DB2 for Linux, UNIX, and Windows, which two keywords are not valid in an XQuery transform expression? There are two correct answers. (a) copy (b) remove (c) return (d) modify (e) append 44. Consider this XML document: <alpha x="1"><beta>2</beta></alpha> What will this document look like when a new node is inserted with the following insert operation?
23.8 Updating and Transforming XML Documents 689 insert attribute y {3} after $new/alpha/beta (a) <alpha x="1" y="3"><beta>2</beta></alpha>. (b) <alpha y="3" x="1"><beta>2</beta></alpha>. (c) <alpha x="1"><beta y="3">2</beta></alpha>. (d) <alpha x="1"><beta>2</beta><y>3</y></alpha>. (e) Both (a) and (b) are possible since the ordering of attributes does not matter. 45. What is the return type of the function XSLTRANSFORM? (a) CLOB(2G) (b) BLOB(2G) (c) XML (d) VARCHAR(32000) (e) INTEGER (either 0 or 1, depending on the success of the transformation) 23.9 DEFINING AND USING XML INDEXES 46. When you create an XML index in DB2 for Linux, UNIX, and Windows, which of the following SQL types is not a valid index type? (a) VARCHAR HASHED (b) DOUBLE (c) INTEGER (d) DATE (e) TIMESTAMP 47. In DB2 for z/OS, which data type do you use to define an XML index for numeric data? (a) DOUBLE (b) DECFLOAT (c) FLOAT (d) REAL (e) DECIMAL
690 Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz 48. An XML index is defined as ‘/product/name’ AS VARCHAR(40). What happens when you insert a document with a product name larger then 40 bytes? (a) This product name will be indexed by a hash value. (b) The insert succeeds but there will be no index entry for this document. A warning is issued. (c) The insert succeeds but there will be no index entry for this document. No warning is issued. (d) The insert fails with an error and the document is rejected. (e) The insert succeeds, and the first 40 bytes of the product name are used as the index key (lookup in a VARCHAR index is by prefix anyway). 49. If you define an XML index on /product/pid as data type DOUBLE or DECFLOAT, what is the default behavior if you insert a document with a product id of PX25? (a) This product pid will be indexed by a hash value. (b) The insert succeeds but there will be no index entry for this document. A warning is issued. (c) The insert succeeds but there will be no index entry for this document. No warning is issued. (d) The insert fails with an error and the document is rejected. (e) It depends on whether the document was validated, and on the schema type for /product/pid. 50. Which index is eligible to evaluate the following query? SELECT XMLQUERY('$i/customerinfo/name' PASSING info AS "i") FROM customer WHERE XMLEXISTS('$i/customerinfo[@Cid < "1005"]' PASSING info AS "i") CREATE INDEX idx1 ON customer(info) GENERATE KEY USING… (a) XMLPATTERN '/customerinfo/@Cid' AS SQL DECFLOAT (b) XMLPATTERN '//@Cid' AS SQL VARCHAR(8) (c) XMLPATTERN '/customerinfo/@Cid' AS SQL VARCHAR HASHED (d) All of the above (e) Answers (b) and (c)
23.9 Defining and Using XML Indexes 691 51. Consider the following table and indexes: CREATE TABLE tab(doc XML) CREATE INDEX idx1 ON tab(doc) GENERATE KEY USING XMLPATTERN '/product/id' AS SQL DECFLOAT CREATE INDEX idx2 ON tab(doc) GENERATE KEY USING XMLPATTERN '/product/name' AS SQL VARCHAR(40) Which of the two indexes can DB2 use to evaluate the following query? SELECT doc FROM tab WHERE XMLEXISTS('$p/product[id = 178] and $p/product[name = "T42p"]' PASSING doc AS "p") (a) None, because a Boolean expression in XMLEXISTS never returns an empty sequence. (b) Only idx1. (c) Only idx2. (d) Both. (e) Both, but only if the document has been validated against a schema. 52. Given the query and the indexes shown in the following, which indexes can the query use? SELECT XMLQUERY('$i/book[@id = 101]/title' PASSING bookinfo AS "i") FROM books CREATE INDEX idx1 ON books(bookinfo) GENERATE KEY USING XMLPATTERN '/book/title' AS SQL VARCHAR(50) CREATE INDEX idx2 ON books(bookinfo) GENERATE KEY USING XMLPATTERN '/book/@id' AS SQL DOUBLE (a) idx1. (b) idx2.
692 Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz (c) Neither, because XML predicates in the XMLQUERY function in the SELECT clause do not eliminate any rows from the result set and therefore cannot use an index. (d) Both idx1 and idx2. (e) Both, but only if the document has been validated. 23.10 XML PERFORMANCE AND MONITORING 53. In DB2 for Linux, UNIX, and Windows, what are the three access plan operators for XML processing? There are three correct answers. (a) XMLSCAN (b) XISCAN (c) XSCAN (d) XJOIN (e) XANDOR 54. In DB2 for z/OS there are four new access plan operators for XML processing. Which of the following is NOT one of them? (a) XIXSCAN (b) XSCAN (c) XIXOR (d) DIXSCAN (e) XIXAND 55. Does the RUNSTATS command collect statistics for XML columns? (a) No. (b) Yes. However, the optimizer does not yet use these statistics. (c) Yes, but only for XML data that has been validated against XML Schemas. (d) Yes. The RUNSTATS command collects statistics for XML data, but not for XML indexes. (e) Yes. The RUNSTATS command collects statistics for XML data and for XML indexes.
23.10 XML Performance and Monitoring 693 56. In DB2 for Linux, UNIX, and Windows, what is XANDOR? (a) A Boolean operator in the XQuery language standard. (b) A new query operator that is only used for joins between relational and XML data. (c) An SQL/XML function to express XQuery joins. (d) A new query operator over XML indexes, used if the query has two or more equality predicates. (e) A built-in function for XML parsing. 57. In DB2 for Linux, UNIX, and Windows, what does the XSCAN operator do? (a) XSCAN is the same as TBSCAN, but for XML data. (b) XSCAN, or cross-scan, is used to compute a join between two XML documents. (c) XSCAN navigates XML documents, evaluates predicates, and extracts XML pieces if needed. (d) XSCAN scans an XML index to evaluate a predicate. (e) XSCAN scans an XML document and shreds it into relational tables. 58. In DB2 for z/OS, what is DIXSCAN? (a) Access to an XML index of type DECFLOAT. (b) XML index access that returns the DOCID and NODEID pairs for a given key value. (c) Directed access to an XML index that was not defined with ALLOW REVERSE SCANS. (d) It represents a scan of XML documents. (e) An operator for DOCID index access that returns a RID for a given DOCID. 23.11 MANAGING XML DATA WITH NAMESPACES 59. In XML documents, XML namespaces are declared with which reserved attribute? (a) xmlnamespace (b) declare (c) nsxml (d) default (e) xmlns
694 Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz 60. Which of the following is an invalid URI? (a) http://www.DB2pureXMLCookbook.org/ (b) ftp://ftp.is.co.za/rfc/rfc3986.txt (c) urn:xmlns:bogus:partner1.0 (d) telnet://192.0.2.16:80/ (e) http://www.DB2 pureXML Cookbook.org/ 61. Which statement about default namespaces is correct? (a) A default namespace always applies to all XML elements, regardless of actual namespace declarations in an XML document. (b) The default namespace does not assign a namespace prefix to a URI. (c) There can be at most one default namespace in an XML document. (d) A default namespace guarantees that all elements and attributes in a document are in the same namespace, even if some elements have prefixes. (e) A default namespace only applies if documents have been validated against an XML Schema that defines the default namespace. 62. Consider the following XML document: <p:product xmlns:p="http://myuri"><p:name>p595<p:name></p:product> How can you declare a namespace in an XML query to retrieve information from this document? (a) declare namespace xyz="http://myuri" (b) declare default namespace "http://myuri" (c) declare default element namespace "http://myuri" (d) declare namespace p="http://anyuri" (e) Both (a) and (c) are correct. 23.12 XML SCHEMAS AND VALIDATION 63. Do you need an XML Schema to store XML documents in a DB2 XML column? (a) Yes, so that DB2 knows how to store the XML documents efficiently. (b) No. XML Schemas are optional. (c) Only if you want to use XQuery on your XML data. (d) No, but there will be no query optimization without a schema. (e) Both (b) and (d) are correct.
23.12 XML Schemas and Validation 695 64. Is a valid document always also a well-formed document? (a) Always (b) Never (c) Depends on the contents of the XML Schema (d) Depends on the contents of the XML document (e) Depends on the setting of the DB2 registry variable DB2_VALID_XML 65. An XML Schema can consist of how many separate schema documents? (a) There can only be 1 schema document per XML Schema. (b) At most two: a primary and a secondary schema document. (c) 32. (d) 256. (e) An XML Schema can consist of an arbitrary number of schema documents. 66. Which command do you use to add an XML Schema to the XML Schema Repository? (a) REGISTER XSROBJECT (b) INSERT XMLSCHEMA (c) ADD XMLSCHEMA (d) REGISTER XMLSCHEMA (e) CREATE XMLSCHEMA 67. In DB2 for z/OS, which of the following is true about the relational schema name in the SQL identifier of an XML Schema? (a) It has to be either omitted or be the value SYSIBM. (b) It can be the name of any defined schema in DB2. (c) It can be any name you have previously registered in the XML Schema Repository. (d) It has to be either omitted or be the value SYSXSR. (e) It has to be the user id of the user performing the registration. 68. Is it possible to store XML documents for multiple XML Schemas in the same XML column? (a) Yes, but only if the documents are inserted without validation. (b) Yes, but only if the documents are inserted with validation. (c) Yes, regardless of validation. (d) No, documents for different XML schemas have to be stored in separate XML columns. (e) Only if the different XML Schemas are backwards compatible.
696 Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz 69. What do the functions XMLVALIDATE and DSN_XMLVALIDATE return if an XML document is valid against the specified XML schema? (a) 0 (b) 1 (c) true (d) The DOCID of the document (e) The document itself 70. What do the functions XMLVALIDATE and DSN_XMLVALIDATE return if an XML document is NOT valid against the specified XML schema? (a) 0 (b) –1 (c) false (d) An error (e) A warning 23.13 PERFORMING FULL-TEXT SEARCH 71. If you want to issue DB2 Net Search Extender commands, you need to prefix them with which of the following? (a) db2nse (b) db2ns (c) db2txt (d) db2text (e) db2ts 72. Which of the following is true about the CREATE INDEX command in Net Search Extender indexes? (a) As soon as the command completes any query can use the index. (b) In the CREATE INDEX command you can specify whether the index will be maintained synchronously or asynchronously. (c) The CREATE INDEX command builds the index in the location defined by the DB2 registry variable DB2_NSE_PATH. (d) The CREATE INDEX command can be issued in the DB2 Command Line Processor or via JDBC calls. (e) After the CREATE INDEX command you need to issue the UPDATE INDEX command before queries can use the index.
23.13 Performing Full-Text Search 697 73. Can you enable the DB2 Net Search Extender and DB2 Text Search in the same database? (a) Yes (b) No (c) Depends on the DB2 registry setting DB2_DUAL_TEXT_SEARCH (d) Yes, but a search query can only use one or the other, not both at the same time (e) Yes, but only in UNICODE databases 74. When you use DB2 Text Search, how many text indexes are allowed per column? (a) 64 (b) 1 (c) As many as you like (within storage limits) (d) Depends on the text search parameter DB2TS_MAX_INDEXES (e) 2—one for values and one for structural information 23.14 XML APPLICATION DEVELOPMENT 75. Which of the following statements is correct? (a) DB2 pureXML allows you to reduce or completely avoid XML parsing in your application and reduces application complexity. (b) DB2 pureXML allows you to reduce or completely avoid XML parsing in your application at the price of greater coding complexity. (c) DB2 pureXML allows you to reduce your application complexity by introducing additional XML parsing in the application layer. (d) DB2 pureXML improves application performance because it never performs XML parsing. (e) DB2 pureXML applications manipulate XML data in the same way as an application that stores XML in CLOB columns. 76. In which language can you NOT use host variables of type XML? (a) C , C++ (b) COBOL (c) PL/1 (d) Fortran (e) Assembler
698 Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz 77. JDBC 4.0 introduces a new application data type to handle XML data. What is the name of this data type? (a) XML (b) SQLXML (c) XMLDOCUMENT (d) XMLFILE (e) DOM 78. In DB2 for Linux, UNIX, and Windows, you can use the XML data type for which of the following? There are two correct answers. (a) Parameters in SQL stored procedures, but not for variables (b) Parameters and variables in SQL stored procedures (c) Parameters and variables in SQL user-defined functions, but not as a return type (d) Parameters, variables, and return type of SQL user-defined functions (e) Parameters and variables in SQL user-defined scalar or table functions, and return type of SQL user-defined scalar functions, but not as return type of a table function 79. What is the internal encoding of an XML document? (a) The internal encoding is always UTF-8. (b) The internal encoding is always UTF-16. (c) The internal encoding is the same as the application code page when XML data is held in character type variable in the application. (d) The internal encoding is the same as the database codepage. (e) The internal encoding is determined by a Unicode Byte-Order Mark or an XML declaration with encoding attribute.
23.14 XML Application Development 699 80. Which statements are true about an XML declaration such as the following? There are two correct answers. <?xml version="1.0" encoding="UTF-8" ?> (a) The XML declaration is optional and not required for an XML document to be well-formed. (b) An XML declaration must always contain an encoding attribute. (c) An XML declaration is not stored as part of a document, but can be generated when XML data is retrieved by an application. (d) The XML declaration of a document must match the XML declaration of its XML Schema. (e) The XML declaration must be a separate line at the beginning of the document; that is, it must end with a new-line character. 81. Which DB2-specific JDBC methods allow you to specify a target encoding for the XML data you retrieve from DB2? There are two correct answers. (a) DB2Xml.getDB2XmlBinaryStream() (b) DB2Xml.getDB2XmlCharacterStream() (c) DB2Xml.getDB2BinaryStream() (d) DB2Xml.getDB2CharacterStream() (e) DB2Xml.getDB2XmlBytes() 82. Which of the following is a valid declaration of an XML host variable? (a) SQL TYPE IS XML AS CLOB(n) <hostvar_name> (b) SQL TYPE IS XML AS BLOB(n) <hostvar_name> (c) SQL TYPE IS XML AS CLOB_FILE <hostvar_name> (d) SQL TYPE IS XML AS DBCLOB_FILE <hostvar_name> (e) All of the above
700 23.15 Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz ANSWERS Question Answer Chapter/Section in this Book with Further Information 1 (e) 1.1, Anatomy of an XML Document 2 (a) 2.3, Choosing the Right Document Granularity 3 (c) 2.1, Choosing Between XML Elements and XML Attributes 4 (c) 2.2, XML Tags Versus Values 5 (c) 3.3.1, Storage Objects for XML Data 6 (c) 3.1, Understanding XML Document Trees 7 (a) 3.3.2, Defining Columns, Tables, and Table Spaces for XML Data 8 (c) 3.3.2, Defining Columns, Tables, and Table Spaces for XML Data 9 (b) 3.11, XML Storage in DB2 for z/OS 10 (d) 3.11, XML Storage in DB2 for z/OS 11 (e) 3.3.1, Storage Objects for XML Data 12 (d) 3.4, Using XML Base Table Row Storage (Inlining) 13 (e) 3.11, XML Storage in DB2 for z/OS 14 (b) 3.7, Reorganizing XML Data and Indexes 15 (a) 3.9, XML in Range-Partitioned Tables and MDC Tables 16 (b) 3.11, XML Storage in DB2 for z/OS 17 (d)(e) 3.11, XML Storage in DB2 for z/OS 18 (e) 4.7, Understanding XML Whitespace and Document Storage 19 (a)(e) 4.6, Dealing with XML Special Characters 20 (c) 4, Inserting and Retrieving XML Data and 20, Understanding XML Data Encoding 21 (b) 4.7.1, Preserving XML Whitespace 22 (a)(c) 5.5, Loading XML Data in DB2 for z/OS 23 (d) 5.7, Splitting Large XML Documents into Smaller Documents 24 (b)(c) 5.1, Exporting XML Data in DB2 for Linux, UNIX, and Windows 25 (a) 6.5, How to Execute XPath in DB2 26 (e) 7, Querying XML Data with SQL/XML 27 (d) 7.5, Common Mistakes with SQL/XML Predicates
23.15 Answers 701 Question Answer Chapter/Section in this Book with Further Information 28 (d) 6.7, XPath Predicates 29 (b) 7, Querying XML Data with SQL/XML 30 (a)(b) 6.5, How to Execute XPath in DB2 31 (c)(e) 8.1, XQuery Overview 32 (d) 6.14, General and Value Comparisons 33 (c) 7, Querying XML Data with SQL/XML 34 (e) 10.1, SQL/XML Publishing Functions 35 (c) 10.1.12, Legacy Functions 36 (c) 10.1, SQL/XML Publishing Functions 37 (e) 10.2, Using XQuery Constructors with Relational Input 38 (b) 11, Converting XML to Relational Data 39 (d) 11.2, Shredding with the XMLTABLE Function 40 (c) 11.3, Shredding with Annotated XML Schemas 41 (d) 11.3, Shredding with Annotated XML Schemas 42 (a) 11.2.2, Relational Views over XML Data 43 (b)(e) 12.2, Modifying Documents with XQuery Updates 44 (e) 12.7, Inserting XML Nodes into a Document 45 (a) 12.14, Transforming XML Documents with XSLT 46 (c) 13.1, Defining XML Indexes 47 (b) 13.2, XML Index Data Types 48 (d) 13.2, XML Index Data Types 49 (c) 13.2, XML Index Data Types 50 (b) 13.2, XML Index Data Types 51 (a) 13, Defining and Using XML Indexes 52 (c) 13.6.1, Special Cases with XMLQUERY 53 (b), (c), (e) 14.1.4, Access Plan Operators 54 (b) 14.1.4, Access Plan Operators 55 (e) 14.3, Statistics Collection for XML Data
702 Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz Question Answer Chapter/Section in this Book with Further Information 56 (d) 14.1.4, Access Plan Operators 57 (c) 14.1.4, Access Plan Operators 58 (e) 14.1.4, Access Plan Operators 59 (e) 15.1.1, Namespace Declarations in XML Documents 60 (e) 15.1, Introduction to XML Namespaces 61 (b) 15.1.2, Default Namespaces 62 (e) 15.3, Querying XML Data with Namespaces 63 (b) 16.1, Introduction to XML Schemas and Their Usage 64 (a) 16.1.1, Valid Versus Well-Formed XML Documents 65 (e) 16.1.1, Valid Versus Well-Formed XML Documents 66 (d) 16.4, Registering XML Schemas 67 (d) 16.4.1, Registering XML Schemas in the DB2 Command Line Processor 68 (c) 17, Validating XML Documents against XML Schemas 69 (e) 17, Validating XML Documents against XML Schemas 70 (d) 17, Validating XML Documents against XML Schemas 71 (d) 19, Performing Full-Text Search 72 (b) 19.4, Managing Full-Text Indexes with the DB2 Net Search Extender 73 (b) 19, Performing Full-Text Search 74 (b) 19.6.2, Creating and Maintaining Full-Text Indexes for DB2 Text Search 75 (a) 21, Developing XML Applications with DB2 76 (d) 21, Developing XML Applications with DB2 77 (b) 21, Developing XML Applications with DB2 78 (b) (d) 18, Using XML in Stored Procedures, UDFs, and Triggers 79 (e) 20.1, Understanding Internal and External XML Encoding 80 (a)(c) 20.1, Understanding Internal and External XML Encoding 81 (a) 21, Developing XML Applications with DB2 82 (e) 21, Developing XML Applications with DB2
A P P E N D I X A Getting Started with DB2 pureXML his appendix explains how to explore XML data in DB2 and how to run basic commands in the DB2 Command Line Processor (CLP) and SPUFI. Note that the CLP is not only available for DB2 for Linux, UNIX, and Windows but also for DB2 for z/OS as an application that requires Unix System Services. T A.1 EXPLORING THE STRUCTURE OF XML DOCUMENTS Before you can start writing XML queries or updates you need to know the structure of the XML documents. A good approach is to look at one or several representative sample documents. You can use the DB2 Control Center, IBM Data Studio, or commands issued in the CLP or SPUFI to view the structure of XML documents. A.1.1 Exploring XML Documents in the DB2 Control Center Figure A.1 shows the DB2 Control Center view of the first XML document in the info column of the customer table. The Source View and Tree View tabs let you switch between a textual and a hierarchical view of the XML document. 703
704 Appendix A Figure A.1 Getting Started with DB2 pureXML Viewing XML documents in the DB2 Control Center Similar capabilities for exploring XML documents are available in IBM Data Studio Developer. For details and screenshot see section 21.9.1. A.1.2 Exploring XML Documents in the CLP To explore XML data in a command-line interface, issue a query that selects one or several rows from an XML column, as shown in Figure A.2. The FETCH FIRST 1 ROWS ONLY option can be used to conveniently limit the output. The SQL statement in Figure A.2 can be issued from the CLP or from SPUFI. Unless you explicitly request DB2 to preserve whitespace, XML documents are stored without line breaks. Hence, each XML document that you retrieve can be a single wrapping line. SELECT info FROM customer FETCH FIRST 1 ROWS ONLY; <customerinfo Cid="1000"><name>Kathy Smith</name><addr country=" Canada"><street>5 Rosewood</street><city>Toronto</city><prov-sta te>Ontario</prov-state><pcode-zip>M6W 1E6</pcode-zip></addr><pho ne type="work">416-555-1358</phone></customerinfo> Let’s add line breaks and indentation to the above to make it easier to read: Figure A.2 Selecting one XML document from a table
A.1 Exploring the Structure of XML Documents 705 <customerinfo Cid="1000"> <name>Kathy Smith</name> <addr country="Canada"> <street>5 Rosewood</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M6W 1E6</pcode-zip> </addr> <phone type="work">416-555-1358</phone> </customerinfo> Figure A.2 Selecting one XML document from a table (continued) You can see that the root element is customerinfo and has an attribute called Cid. The child elements of customerinfo are the name, addr, and phone elements. Under the addr element are the street, city, prov-state, and pcode-zip elements. The phone element has an attribute called type. A.1.3 Exploring XML Documents in SPUFI For DB2 for z/OS you might prefer to use SPUFI rather than the CLP. If you use SPUFI to retrieve and examine XML documents you might need to change the MAX CHAR FIELD setting in the SPUFI defaults panel to be larger than the character length of the XML document. In Figure A.3 the value has been set to 2000. (In this figure, the right side of the SPUFI output has been truncated to fit the page.) Output format characteristics: 14 MAX NUMERIC FIELD ===> 33 15 MAX CHAR FIELD .. ===> 2000 16 COLUMN HEADING .. ===> NAMES Figure A.3 (Maximum width for numeric fi (Maximum width for character (NAMES, LABELS, ANY or BOTH) Changing the SPUFI settings for MAX CHAR FIELD XML and XPath are case-sensitive, so you need to ensure that CAPS are turned off. Also ensure that the terminal session CCSID setting is consistent with the application encoding scheme, because “[” and “]” have different code points in different code pages. If the CCSID settings are not consistent, then the query in Figure A.4 fails with error SQLCODE 16002. SELECT cid, info FROM customer WHERE XMLEXISTS('$i/customerinfo/addr[city="Toronto"]' PASSING info AS "i") Figure A.4 Sample query on DB2 for z/OS
706 A.2 Appendix A Getting Started with DB2 pureXML TIPS FOR RUNNING XML OPERATIONS IN THE CLP The Command Line Processor (CLP) in DB2 for Linux, UNIX, and Windows offers various options that are useful when you run XQuery, SQL/XML, or INSERT and UPDATE statements with XML data. These options are listed in Table A.1. Table A.1 CLP Options That Are Useful for XML Option Purpose -i Displays the XML results produced by an XQuery with indentation and line breaks for better readability (pretty print). Without this option, XML data is returned as a continuous string without line breaks. This option only works for XQuery, not for SQL/XML. -d Generates an XML declaration at the beginning of every XML document or XML value that is returned. Without -d, XML declarations are omitted. -q Preserves all whitespace in any command that is executed. Without the -q option, the DB2 CLP strips newline characters before sending your command to the DB2 server. This option matters when you insert XML documents through the CLP and want to preserve whitespace (see Chapter 4 for details). -td# The -td option defines a character as the statement terminator. In this case the statement termination character is set to #. If just the -t option is used, the default statement termination character is the semicolon (;), which can conflict with namespace declarations or statements in CREATE PROCEDURE, CREATE FUNCTION, or CREATE TRIGGER statements. Here are three examples of invoking the CLP from an operating system prompt: • db2 -i -t invokes the CLP with pretty print for XQuery results and using the semicolon as the statement terminator. • db2 -d -td% invokes the CLP with the percent sign as the statement terminator and enables XML declarations in XML query results. • db2 -q -i -td# invokes the CLP with whitespace preservation, pretty print for XQuery results, and the # character as the statement terminator. On Windows, the CLP needs to be invoked from an operating system prompt in a DB2 command window. In Figure A.5, a CLP session is invoked with db2 -i -td#. Since a termination character is used, you can enter multiline commands at the db2=> prompt. For example, you can cut and paste a multiline statement from a text file into the CLP. To complete and submit the command, type the termination character, which is # in the example in Figure A.5, and press Enter.
A.1 Exploring the Structure of XML Documents Figure A.5 Using the DB2 CLP with a non-default termination character Figure A.6 shows how output of the command in Figure A.5 is returned with pretty print: <customerinfo Cid="1002"> <name> Jim Noodle </name> <addr country="Canada"> <street> 25 EastCreek </street> <city> Markham </city> <prov-state> Ontario </prov-state> <pcode-zip> N9C 3T6 </pcode-zip> </addr> <phone type="work"> 905-555-7258 </phone> </customerinfo> 1 record(s) selected. Figure A.6 CLP output from an XQuery when the -i option is used 707
708 Appendix A Getting Started with DB2 pureXML Note that the CLP returns XML data as a 4KB character column. Documents larger than 4KB are truncated. Use the DB2 EXPORT command if you want to retrieve full documents larger than 4KB through the CLP (see Chapter 5, Moving XML Data). If you have DB2 commands or SQL statements in a text file, you can execute these commands and statements by providing the text file as an input parameter to the CLP (-f). Figure A.7 shows two examples. The first line executes the commands in the file Q2.txt and assumes that each statement in the file is terminated with the semicolon as the default termination character (-t). The -v option produces verbose output. The second line executes the commands in the file Q3.sql and expects these commands to end with the # character. db2 -t -v -f Q2.txt db2 -td# -f Q3.sql Figure A.7 Executing the commands in the files Q2.txt and Q3.sql
A P P E N D I X B The XML Sample Database hroughout this book we often use the XML sample database that comes with DB2 for Linux, UNIX, and Windows. This appendix describes how to create the sample database, shows some of its content, and explains how to set up sample tables in DB2 for z/OS. T B.1 XML SAMPLE DATABASE ON DB2 FOR LINUX, UNIX, AND WINDOWS Issue the following command at the OS prompt to create the sample database with the database name samplxml: db2sampl -name sampxml -xml Without the -name option the default database name is sample. The -xml flag is required to create tables with XML data in the sample database. The relational database schema used for these tables is the user ID of the person who issued the db2sampl command. In the examples in this book we use the tables customer, purchaseorder, and product. The columns in these tables are shown in Table B.1. The XML column history in the customer table does not contain any data when the sample database is initially created. 709
710 Appendix B Table B.1 The XML Sample Database Sample Database Tables on DB2 for Linux, UNIX, and Windows Table Name Column Name Column Type CUSTOMER CID INFO HISTORY BIGINT XML XML PURCHASEORDER POID STATUS CUSTID ORDERDATE PORDER COMMENTS BIGINT VARCHAR(10) BIGINT DATE XML VARCHAR(1000) PRODUCT PID NAME PRICE PROMOPRICE PROMOSTART PROMOEND DESCRIPTION VARCHAR(10) VARCHAR(128) DECIMAL(30,2) DECIMAL(30,2) DATE DATE XML Since DB2 9.1 Fixpack 7 and DB2 9.5 Fixpack 4, the XML data in the customer, product, and purchaseorder tables of the sample database no longer contain namespaces. This makes it easier to get started with querying, updating, and indexing XML data. The suppliers table in the sample database still contains namespaces. B.2 XML SAMPLE TABLES ON DB2 FOR Z/OS In DB2 for z/OS, The installation job DSNTEJ1 creates five tables with XML columns. These tables are in the relational schema DSN8910 and are named PRODUCT, CUSTOMER, PURCHASEORDER, CATALOG, and SUPPLIERS. These tables are not populated by the installation job. There are several ways to populate some of these tables. For example, if you have a DB2 for Linux, UNIX, and Windows installation, such as the free DB2 Express-C, you can create the sample database and select or export the data from there. The data can then be imported or inserted into the z/OS tables using SUPFI or an import job. The PDF document “DB2 Version 9.1 for z/OS XML Guide” (SC18-9858) provides the DDL and three INSERT statements with XML data for a table called MYCUSTOMER. You can copy and paste these statements into SPUFI to build a sample table to work with. B.3 TABLE CUSTOMER—COLUMN INFO The customer table contains the following six documents.
B.3 Table customer—Column info 711 <customerinfo Cid="1000"> <name>Kathy Smith</name> <addr country="Canada"> <street>5 Rosewood</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M6W 1E6</pcode-zip> </addr> <phone type="work">416-555-1358</phone> </customerinfo> <customerinfo Cid="1001"> <name>Kathy Smith</name> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <phone type="work">905-555-7258</phone> </customerinfo> <customerinfo Cid="1002"> <name>Jim Noodle</name> <addr country="Canada"> <street>25 EastCreek</street> <city>Markham</city> <prov-state>Ontario</prov-state> <pcode-zip>N9C 3T6</pcode-zip> </addr> <phone type="work">905-555-7258</phone> </customerinfo> <customerinfo Cid="1003"> <name>Robert Shoemaker</name> <addr country="Canada"> <street>1596 Baseline</street> <city>Aurora</city> <prov-state>Ontario</prov-state> <pcode-zip>N8X 7F8</pcode-zip> </addr> <phone type="work">905-555-7258</phone> <phone type="home">416-555-2937</phone> <phone type="cell">905-555-8743</phone> <phone type="cottage">613-555-3278</phone> </customerinfo> <customerinfo Cid="1004"> <name>Matt Foreman</name> <addr country="Canada"> <street>1596 Baseline</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M3Z 5H9</pcode-zip> (continues)
712 Appendix B The XML Sample Database </addr> <phone type="work">905-555-4789</phone> <phone type="home">416-555-3376</phone> <assistant> <name>Gopher Runner</name> <phone type="home">416-555-3426</phone> </assistant> </customerinfo> <customerinfo Cid="1005"> <name>Larry Menard</name> <addr country="Canada"> <street>223 NatureValley Road</street> <city>Toronto</city> <prov-state>Ontario</prov-state> <pcode-zip>M4C 5K8</pcode-zip> </addr> <phone type="work">905-555-9146</phone> <phone type="home">416-555-6121</phone> <assistant> <name>Goose Defender</name> <phone type="home">416-555-1943</phone> </assistant> </customerinfo> B.4 TABLE PRODUCT—COLUMN DESCRIPTION The product table contains the following four documents. <product pid="100-100-01"> <description> <name>Snow Shovel, Basic 22 inch</name> <details>Basic Snow Shovel, 22 inches wide, straight handle with D-Grip</details> <price>9.99</price> <weight>1 kg</weight> </description> </product> <product pid="100-101-01"> <description> <name>Snow Shovel, Deluxe 24 inch</name> <details>A Deluxe Snow Shovel, 24 inches wide, ergonomic curved handle with D-Grip</details> <price>19.99</price> <weight>2 kg</weight> </description> </product> <product pid="100-103-01"> <description> <name>Snow Shovel, Super Deluxe 26 inch</name> (continues)
B.5 Table purchaseorder—Column porder 713 <details>Super Deluxe Snow Shovel, 26 inches wide, ergonomic battery heated curved handle with upgraded D-Grip</details> <price>49.99</price> <weight>3 kg</weight> </description> </product> <product pid="100-201-01"> <description> <name>Ice Scraper, Windshield 4 inch</name> <details>Basic Ice Scraper 4 inches wide, foam handle</details> <price>3.99</price> </description> </product> B.5 TABLE PURCHASEORDER—COLUMN PORDER The purchaseorder table contains the following six documents. <PurchaseOrder PoNum="5000" OrderDate="2006-02-18" Status="Unshipped"> <item> <partid>100-100-01</partid> <name>Snow Shovel, Basic 22 inch</name> <quantity>3</quantity> <price>9.99</price> </item> <item> <partid>100-103-01</partid> <name>Snow Shovel, Super Deluxe 26 inch</name> <quantity>5</quantity> <price>49.99</price> </item> </PurchaseOrder> <PurchaseOrder PoNum="5001" OrderDate="2005-02-03" Status="Shipped"> <item> <partid>100-101-01</partid> <name>Snow Shovel, Deluxe 24 inch</name> <quantity>1</quantity> <price>19.99</price> </item> <item> <partid>100-103-01</partid> <name>Snow Shovel, Super Deluxe 26 inch</name> <quantity>2</quantity> <price>49.99</price> </item> <item> <partid>100-201-01</partid> <name>Ice Scraper, Windshield 4 inch</name> <quantity>1</quantity> <price>3.99</price> (continues)
714 Appendix B The XML Sample Database </item> </PurchaseOrder> <PurchaseOrder PoNum="5002" OrderDate="2004-02-29" Status="Shipped"> <item> <partid>100-100-01</partid> <name>Snow Shovel, Basic 22 inch</name> <quantity>3</quantity> <price>9.99</price> </item> <item> <partid>100-101-01</partid> <name>Snow Shovel, Deluxe 24 inch</name> <quantity>5</quantity> <price>19.99</price> </item> <item> <partid>100-201-01</partid> <name>Ice Scraper, Windshield 4 inch</name> <quantity>5</quantity> <price>3.99</price> </item> </PurchaseOrder> <PurchaseOrder PoNum="5003" OrderDate="2005-02-28" Status="UnShipped"> <item> <partid>100-100-01</partid> <name>Snow Shovel, Basic 22 inch</name> <quantity>1</quantity> <price>9.99</price> </item> </PurchaseOrder> <PurchaseOrder PoNum="5004" OrderDate="2005-11-18" Status="Shipped"> <item> <partid>100-100-01</partid> <name>Snow Shovel, Basic 22 inch</name> <quantity>4</quantity> <price>9.99</price> </item> <item> <partid>100-103-01</partid> <name>Snow Shovel, Super Deluxe 26 inch</name> <quantity>2</quantity> <price>49.99</price> </item> </PurchaseOrder> <PurchaseOrder PoNum="5006" OrderDate="2006-03-01" Status="Shipped"> <item> <partid>100-100-01</partid> <name>Snow Shovel, Basic 22 inch</name> <quantity>3</quantity> <price>9.99</price> (continues)
B.5 Table purchaseorder—Column porder </item> <item> <partid>100-101-01</partid> <name>Snow Shovel, Deluxe 24 inch</name> <quantity>5</quantity> <price>19.99</price> </item> <item> <partid>100-201-01</partid> <name>Ice Scraper, Windshield 4 inch</name> <quantity>5</quantity> <price>3.99</price> </item> </PurchaseOrder> 715
This page intentionally left blank
A P P E N D I X C Further Reading T C.1 his appendix contains links to useful resources, grouped by chapter. GENERAL RESOURCES FOR ALL CHAPTERS The DB2 9.5 and 9.7 for Linux, UNIX, and Windows Information Centers: • http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp • http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp The DB2 9 for z/OS Information Center: • http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp If any pureXML question is not answered in this book, the fastest way to get an answer is to post a question in the DB2 pureXML forum: • http://www.ibm.com/developerworks/forums/forum.jspa?forumID=1423 The IBM developerWorks Wiki for DB2 XML has a variety of technical articles on DB2 pureXML: • http://www.ibm.com/developerworks/wikis/display/db2xml/Technical+Papers+and +Articles Download DB2 Express-C, which is free to use, deploy, and distribute: • http://www.ibm.com/software/data/db2/express/download.html 717
718 C.2 Appendix C Further Reading CHAPTER-SPECIFIC RESOURCES Chapter 1: Introduction New to DB2? Start here! • http://www-128.ibm.com/developerworks/wikis/pages/viewpage.action?pageId=2658 New to XML? Start here! • http://www.ibm.com/developerworks/xml/newto/ • http://www.w3schools.com/ Good books about XML in general include the following: • XML in a Nutshell, 3rd Edition, by Elliotte Rusty Harold and Scott Means (O’Reilly, ISBN 0-596-00764-7) • Beginning XML, 4th Edition, by David Hunter et al. (Wrox, ISBN 0-470-11487-8) • Professional XML, by Bill Evjen et al. (Wrox, ISBN 0-471-77777-3) Chapter 2: Designing XML Data and Applications Three very interesting blog entries on the problems with Name/Value pairs in data modeling. The authors look at it from a relational point of view, but similar problems apply to Name/Value pairs in XML notation. • http://decipherinfosys.wordpress.com/2007/01/29/name-value-pair-design/ • http://geekswithblogs.net/darrengosbell/articles/KVPsInDatabaseDesign.aspx • http://www.ibridge.be/?p=15 Further discussion of the design question “elements versus attributes” can be found in the developerWorks article “Principles of XML design: When to use elements versus attributes”. . . • http://www.ibm.com/developerworks/xml/library/x-eleatt.html . . . and in a section of the w3schools.com website: • http://www.w3schools.com/DTD/dtd_el_vs_attr.asp Chapter 3: Designing and Managing XML Storage Objects These “best practices” articles provide excellent guidelines for database storage, range partitioning, multidimensional clustering, and other physical database design topics in DB2 for Linux, UNIX, and Windows: • “Best Practices—Database Storage” • “Best Practices—Physical Database Design” • “Best Practices—Data Life Cycle Management”
C.2 Chapter-Specific Resources 719 They are available at http://www.ibm.com/developerworks/data/bestpractices/. For deeper information on the pureXML implementation in DB2 for z/OS, read this paper by Guogen Zhang: • http://www.geocities.com/zhanggene/pub/ScalableNativeXMLDB.pdf Chapter 4: Inserting and Retrieving XML Data The DB2 for Linux, UNIX, and Windows Exchange is a place where users and IBMers share code samples, scripts, examples, and other goodies. Here is where you can get the UDFs that are explained in Chapter 4. • http://www.ibm.com/developerworks/exchange/dw_categoryView.jspa?categoryID=974&showAll=true For deep details on reserved characters in XML, whitespace, attribute normalization, digital signatures, and more, try these links: • http://www.w3.org/TR/REC-xml/#sec-white-space • http://www.w3.org/TR/REC-xml/#sec-line-ends • http://www.w3.org/TR/2002/REC-xmldsig-core-20020212/ Chapter 5: Moving XML Data The only place that has additional information on moving XML data into and out of DB2 databases is the DB2 Information Center: • http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/topic/com.ibm.db2.luw. xml.doc/doc/c0024120.html • http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db29.doc. ugref/db2z_loaddataxmlcolumns.htm • http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db29.doc. xml/db2z_xmlutil.htm Chapters 6, 7, 8, and 9: Querying XML Data If you are new to XPath and XQuery then the tutorials at w3schools.com are highly recommended for a quick introduction: • http://www.w3schools.com/xpath/default.asp • http://www.w3schools.com/xquery/default.asp The ultimate book on XQuery by Don Chamberlin et al. is XQuery from the Experts: A Guide to the W3C XML Query Language: • http://www.amazon.com/XQuery-Experts-Guide-Query-Language/dp/0321180607
720 Appendix C Further Reading Relational views over XML (using the XMLTABLE) are helpful to “Create business reports for XML Data with Cognos 8 BI and DB2 pureXML”: • http://www.ibm.com/developerworks/db2/library/techarticle/dm-0811saracco/ The exercise “XQuery use cases converted to SQL/XML in DB2 9 for z/OS” shows examples of how easy it is to work around the lack of XQuery in DB2 for z/OS: • http://www.ibm.com/developerworks/wikis/download/attachments/2500/ XQueryUseCases.zip The complete reference of all supported XPath and XQuery functions can be found in the DB2 for z/OS information center (search for “Descriptions of XPath functions”) and the DB2 for Linux, UNIX, and Windows information center (search for “Functions by category”): • http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db29.doc. xml/db2z_xpxqfunctionreference.htm • http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/topic/com.ibm.db2.luw.xml. doc/doc/xqrfncategory.html The specifications of the XPath and XQuery standards are a tough read if you are new to querying XML data, but very valuable when you get more advanced and dig deeper into the details of the language: • http://www.w3.org/TR/xpath20/ • http://www.w3.org/TR/xquery The built-in functions in XPath and XQuery are defined and formally specified in: • http://www.w3.org/TR/xquery-operators/ Chapter 10: Producing XML from Relational Data If you wonder where the SQL/XML Publishing functions are covered in the DB2 documentation, take a look at the SQL Reference for DB2 for z/OS and DB2 for Linux, UNIX, Windows, Version 8 or higher. IBM Rational Data Architect provides a graphical mapping tool to construct XML from relational tables and generates SQL/XML publishing queries for you. Further details are available in an article and a tutorial: • http://www.ibm.com/developerworks/db2/library/techarticle/dm-0710kokkat/ • http://www.ibm.com/developerworks/edu/dm-dw-dm-0609bittner-i.html
C.2 Chapter-Specific Resources 721 Chapter 11: Converting XML to Relational Data Mayank’s article, “From DAD to annotated XML schema decomposition,” helps you migrate from the XML Extender to the new shredding capabilities in DB2 9.x: • http://www.ibm.com/developerworks/db2/library/techarticle/dm-0604pradhan/ The article “Shred XML documents using DB2 pureXML” provides a useful comparison of shredding techniques with examples: • http://www.ibm.com/developerworks/db2/library/techarticle/dm-0801ledezma/ Interested in a case study on shredding XML with DB2? Take a look at this article: • http://www.ibm.com/developerworks/data/library/techarticle/dm-0804nicola/ Chapter 12: Updating and Transforming XML Documents The implementation of XQuery Updates in DB2 for Linux, UNIX, and Windows is based on the following W3C specification of the XQuery Update Facility: • http://www.w3.org/TR/2006/WD-xqupdate-20060711/ The formal documentation of the XQuery Update facility in DB2 starts here: • http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/topic/com.ibm.db2.luw.xml. doc/doc/xqrupdcontainer.html An introduction to XSLT: • http://www.w3schools.com/xsl/default.asp Chapter 13: Defining and Using XML Indexes The article “On the Path to Efficient XML Queries” describes in more detail how the semantics of the XQuery and SQL/XML languages affect the eligibility of XML indexes for XML queries: • http://www.vldb.org/conf/2006/p1117-balmin.pdf Chapter 14: XML Performance and Monitoring Some of the most common questions about XML performance are answered in the article “A performance comparison of DB2 9 pureXML with CLOB and shredded XML storage”: • http://www.ibm.com/developerworks/db2/library/techarticle/dm-0612nicola/ XML Database Benchmark: Transaction Processing over XML (TPoX): • http://tpox.sourceforge.net/
722 Appendix C Further Reading If your queries are included in a stored procedure, here is how to collect the access plan for a stored procedure in DB2 for Linux, UNIX, and Windows: • http://www-01.ibm.com/support/docview.wss?uid=swg21279292 The white paper “DB2 9 and z/OS XML System Services Synergy Update” by Judy Ruby-Brown and Akiko Hoshikawa contains a lot of z/OS-specific performance and monitoring information for pureXML. Highly recommended for mainframe users: • http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP101227 More information on the IBM DB2 Optimization Service Center for DB2 for z/OS: • http://www.ibm.com/software/data/db2/zos/downloads/osc.html Chapter 15: Managing XML Data with Namespaces There is nothing DB2-specific about managing XML data with namespaces. The better you understand namespace in general, the easier it is to work with namespaces in DB2. The following are our Top 3 resources on XML namespaces: • http://www.w3schools.com/XML/xml_namespaces.asp • http://www.rpbourret.com/xml/NamespacesFAQ.htm • http://www.w3.org/TR/REC-xml-names/ Chapter 16: Managing XML Schemas and Chapter 17: Validating XML Documents against XML Schemas The XML Schema Primer. This document is intended to be an easily readable description of how XML Schemas work. It is less formal and less abstract than the official XML Schema specification: • http://www.w3.org/TR/xmlschema-0/ An even simpler introduction is this tutorial: • http://www.w3schools.com/schema/default.asp An excellent collection of best practices for designing XML Schemas: • http://www.xfront.com/BestPracticesHomepage.html The complete and formal specification of the XML Schema language consists of two parts. The first part is on structures, the second part is on data types: • http://www.w3.org/TR/xmlschema-1/, http://www.w3.org/TR/xmlschema-2/
C.2 Chapter-Specific Resources 723 If you like O’Reilly books, then their XML Schema book by Eric van der Vlist is a good choice to learn all about XML Schema. (O’Reilly Media, ISBN: 0596002521): • http://oreilly.com/catalog/9780596002527/ Definitive XML Schema is written by Priscilla Walmsley, an XML expert and member of the W3C XML Schema Working Group from 1999 to 2004 (Prentice Hall, ISBN: 0130655678): • http://www.datypic.com/books/defxmlschema/ The XML Schema Companion by Neil Bradley is another very accessible XML Schema guide (Addison-Wesley, ISBN: 0321136179): • http://www.amazon.com/XML-Schema-Companion-Neil-Bradley/dp/0321136179 Sample scripts and demonstrations of using industry standard XML Schemas with DB2, XQuery, web services, Atom feeds, and forms: • http://www.alphaworks.ibm.com/tech/purexml/download • http://www.ibm.com/developerworks/wikis/display/db2xml/IndustryFormatsAndServicesWithpureXML The DB2 for z/OS XSR Setup and Troubleshooting Guide can be found on this page: • http://www.ibm.com/developerworks/wikis/display/db2xml/DB2+for+zOS+pureXML Chapter 18: Using XML in Stored Procedures, UDFs, and Triggers The most complete reference on SQL stored procedures, user-defined functions, and triggers is the following book: DB2 SQL PL: Essential Guide for DB2 UDB on Linux, UNIX, Windows, i5/OS, and z/OS, 2nd Edition, IBM Press, ISBN 0-13-147700-5: • http://www.ibmpressbooks.com/bookstore/product.asp?isbn=0131477005 This article provides fundamental performance guidelines for SQL stored procedures: • http://www.ibm.com/developerworks/data/library/techarticle/0306arocena/ 0306arocena.html Chapter 19: Performing Full-Text Search Information about downloading and installing the DB2 Net Search Extender: • http://www.ibm.com/software/data/db2/9/download.html • ftp://ftp.software.ibm.com/ps/products/db2/info/vr95/pdf/en_US/cteu9e951.pdf
724 Appendix C Further Reading Tuning the Performance of Full-Text Indexing in the DB2 Net Search Extender • http://www.ibm.com/developerworks/wikis/download/attachments/1824/DB2+ NSE+indexing+performance.pdf Key Concepts of IBM OmniFind Text Search for DB2 for z/OS: • http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db29.doc. srchz/srchz_keyconcepts.htm Documentation of the IBM OmniFind Text Search Server for DB2 for z/OS: • http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db29.doc. srchz/dsntsk11.pdf Chapter 20: Understanding XML Data Encoding Useful information about Unicode: • http://www.unicode.org/faq/ , http://en.wikipedia.org/wiki/Unicode, and http://www.utf-8.com/ More detailed information on Unicode Byte Order Marks and the automatic detection of character encodings in XML documents: • http://www.w3.org/TR/REC-xml/#sec-guessing • http://en.wikipedia.org/wiki/Byte-order_mark • http://unicode.org/faq/utf_bom.html#BOM Chapter 21: Developing XML Applications with DB2 “DB2 Express-C: The Developer Handbook for XML, PHP, C/C++, Java, and .NET”: • http://www.redbooks.ibm.com/abstracts/sg247301.html The official Java documentation of the SQLXML interface in JDBC 4.0: • http://java.sun.com/javase/6/docs/api/java/sql/SQLXML.html Data bindings between XML and Java objects: • http://www.ibm.com/developerworks/library/x-databdopt/index.html Introduction to pureQuery: • http://www.ibmdatabasemag.com/dbadmin/showArticle.jhtml?articleID=207801106 Handle pureXML data in Java applications with pureQuery: • http://www.ibm.com/developerworks/data/library/techarticle/dm-0901rodrigues/
C.2 Chapter-Specific Resources 725 DeveloperWorks article “Develop proof-of-concept .NET applications, Part 1: Create database objects in DB2 Viper using .NET”: • http://www.ibm.com/developerworks/edu/dm-dw-dm-0605xia-i.html “Develop proof-of-concept .NET applications, Part 4: Wire your application to DB2 pureXML data”: • http://www.ibm.com/developerworks/edu/dm-dw-dm-0608xia-i.html Build a DB2 pureXML application in a day: • http://www.ibm.com/developerworks/db2/library/techarticle/dm-0812malaika/ XML manipulation in COBOL 4.1 is documented in Chapters 28 and 29 of the COBOL Programming Guide (SC23-8529-00): • http://publibfp.boulder.ibm.com/epubs/pdf/igy3pg40.pdf All DB2 PHP functions in the extension ibm_db2 are documented here: • http://www.php.net/manual/en/ref.ibm-db2.php PHP 5.1.2 for the z/OS UNIX System Services platform: • http://www.ibm.com/servers/eserver/zseries/zos/unix/ported/php/index.html Resources for developing Perl application with DB2: • http://www.ibm.com/developerworks/data/library/techarticle/dm-0512greenstein/ • http://www.ibm.com/software/data/db2/perl • http://search.cpan.org/~ibmtordb2/ Resources for developing XML applications in Perl: • http://www.ibm.com/developerworks/xml/library/x-xmlperl1.html • http://www.ibm.com/developerworks/xml/library/x-xmlperl2.html • http://www.ibm.com/developerworks/xml/library/x-xmlperl3.html DB2 and pureXML with Ruby on Rails: • http://www.ibm.com/developerworks/db2/library/techarticle/dm-0706chun/index.html • http://www.alphaworks.ibm.com/tech/db2onrails • http://antoniocangiano.com/2008/02/08/essential-guide-to-the-ruby-driver-for-db2/ Chapter 22: Exploring XML Information in the DB2 Catalog Refer to the DB2 Information Center for details on the DB2 catalog.
726 Appendix C Further Reading C.3 RESOURCES ON THE INTEGRATION OF DB2 PUREXML WITH OTHER PRODUCTS IBM Data Studio Developer 2.1 is available from: • http://www.ibm.com/software/data/studio/ • http://www.ibm.com/developerworks/spaces/datastudio Lotus Forms, XForms, and DB2 pureXML: • http://www.ibm.com/developerworks/wikis/download/attachments/1824/ LotusFormsXFormsandDB2pureXML.pdf WebSphere DataPower and DB2 pureXML: • http://www.ibm.com/developerworks/db2/library/techarticle/dm-0805malaika3/ • http://www.ibm.com/developerworks/db2/library/techarticle/dm-0805malaika3/ Universal Services for pureXML using Data Web Services: • http://www.ibm.com/developerworks/db2/library/techarticle/dm-0805malaika/ Creating business reports for XML data with Cognos BI and DB2 pureXML: • http://www.ibm.com/developerworks/db2/library/techarticle/dm-0811saracco/ Using industry standard data formats with WebSphere ESB and DB2 pureXML: • http://www.ibm.com/developerworks/websphere/techjournal/0706_elhilaly/ 0706_elhilaly.html DB2 pureXML and Altova XMLSPY: • http://www.altova.com/whitepapers/ibm.pdf DeveloperWorks article, “DB2 and Rational: Working together, Part 1: Introduction to DB2 development with Rational Application Developer”: • http://www.ibm.com/developerworks/edu/dm-dw-dm-0512eaton-i.html IBM InfoSphere Data Architect (previously know as Rational Data Architect): • http://www.ibm.com/developerworks/downloads/r/rda/learn.html?S_ TACT=105AGX28&S_CMP=DLMAIN
Index Symbols & (ampersand), escaping 88 * (asterisk) as wildcard character, 140, 594 @ (at sign) in XPath, 135 @* XPath wildcard, 140 , (comma) operator, construction of sequences, 154 $ (dollar sign) in XML column references, 161 XQuery variable names, 196 . (dot), current context in XPath, 151-153 // (double slash) in XPath, 141-142 in XPath predicates, 146 != (not equal) comparison operator, not( ) function versus, 150 % (percent sign) in wildcard searches, 583 .. (parent directory) in file system navigation, 133 .. (parent step) in XPath, 151-153 | (pipe) character union of sequences, 154 as XPath union operator, 585 ? (question mark) as wildcard character, 594 ; (semicolon) in namespace declarations, 448 in stored procedures, 549 ’ (single quotes), escaping, 571 / (slash) in file system navigation, 133 in XPath, 141 in XPath predicates, 145 _ (underscore character) in wildcard searches, 583 A abbreviated syntax in XPath, 157 access control, 9 access plans. See execution plans ADD XMLSCHEMA command, 485 727 adjust-date-to-timezone function (XQuery), 226 ADMIN_EST_INLINE_ LENGTH function, 45-47 ADMIN_IS_INLINED function, 44-45 ADO.NET data providers, list of, 631 aggregate functions, 278 aggregation. See also grouping XML construction with, 207-208 of XML data, 233-239 within and across documents, 236-237 XMLTABLE function, 234-236 with XMLAGG function (SQL/XML), 277-283 aggregation functions in XQuery, 218-220 ALTER INDEX command (DB2 Net Search Extender), 580 altering text indexes with DB2 Net Search Extender, 580 Altova XML tools, 656-658
728 ampersand (&), escaping 88 AND operator, 149 in full-text searches, 584 annotated schema shredding, 306-318 advantages/disadvantages of, 301 annotating XML Schema, 306-310 defining annotations in Data Studio Developer, 311 registering annotated schemas, 311-312 shredding multiple XML documents, 315-318 shredding single XML documents, 312-315 Annotated XSD Mapping Editor, 311 APAR II14426, xxvi APARs, list of, 72-73 APIs, 9 application code page, 599 application development, 609 CLI applications, 636-639 embedded SQL applications, 639-647 C applications with, 645-647 COBOL applications with, 640-642 PL/1 applications with, 643-644 for DB2 pureXML, 9 host variables, 613-614 Java applications, 615-631 JDBC 3.0, XML support in, 615-619 JDBC 4.0, example usage, 621-627 JDBC 4.0, XML support in, 619-621 pureQuery, 629-631 XML data binding, 629 XML documents, creating from application data, 627-628 Index .NET applications, 631-636 ADO.NET data providers, list of, 631 inserting XML data from, 635 manipulating XML data in, 633-635 querying XML data in, 632-633 XML Schema and DTD handling, 636 parameter markers, 613-614 Perl applications, 650-651 PHP applications, 647-649 pureXML, benefits of, 610-613 tools for Altova XML tools, 656-658 IBM Data Studio Developer, 652-653, 655 IBM Database Add-ins for Visual Studio, 656 list of, 651 <oXygen/>, 658-659 Stylus Studio, 659 application layer, avoiding parsing in, 610-612 application-centric validation, 545 applications (XML), best practices, 434-435 arithmetic expressions, 190 in XQuery, 212-214 asterisk (*) as wildcard character, 140, 594 atomic values (XQuery Data Model), 129 attaching partitions, 57 attribute axis, 157 attribute constructors (XQuery), 290-292 attribute expressions, XML construction with, 206 attribute nodes, 29, 129 attribute values versus, 136 attribute values, attribute nodes versus, 136 attributes in path expressions, 135 XPath wildcards for, 141 attributes (objects), sparse, 13 attributes (XML), 2-4. See also nodes constructing from relational data, 275-277 converting to/from XML elements, 345-346 elements versus, 15-19 extracting value of, 557 full names, 441, 443-444 index creation and, 459 indexing, 8 inserting, defining position for, 336-337 namespaces and, 440-441 optional, 13 renaming, 334-335 updating with stored procedures, 554-555 values, replacing, 327-328 automatic updates for text indexes, 574-576 axes in XPath, 157 B BACKUP PENDING status, 111 backward compatibility of XML Schema versions, 495-498 base table row storage. See inlining BEFORE triggers, 523 Bernoulli sampling, 419 best practices for XML performance, 428-435 between predicates, 431 in XML queries, 254-256 binary data as internally encoded, 618 binary data types, 606
Index binary SQL types, converting XML values to, 187-188 binding. See XML data binding BLOB data type, inserting XML documents 80 blobFromFile UDF, 81 blobsFromZipURL UDF, 81-82 blocking cursors, 435 BOM (Byte-Order Mark), 599 Boolean expressions, predicates versus, 146 Boolean functions in XQuery, 226 Boolean operators in full-text searches, 583-584 boost modifiers, 594 boundary whitespace, 90-91 preserving 91-93 bulk shredding of XML documents, 315-318 business data. See data business objects data representation of, 12-13 storage of, 612 Byte-Order Mark (BOM), 599 C C applications with embedded SQL, 645-647 Call Level Interface. See CLI application development cardinality of XML indexes, 363 Cartesian products, 240 case-insensitive XML queries, 252-253 cast expressions, 190 in XQuery, 208-212 castable XQuery expression, 211 casting. See converting 729 catalog tables (DB2 for z/OS), XML-related, 667-673 for XML indexes, 671-672 XML Schema Repository (XSR), 503-508, 672 for XML storage objects, 667-670 catalog views (DB2 for Linux, UNIX, and Windows), XML-related, 661-667 SYSCAT.COLUMNS, 661-662 SYSCAT.INDEXES, 663-664 SYSCAT.INDEXXMLPATTERNS, 664-666 SYSIBM.SYSXMLPATHS, 663 SYSIBM.SYSXMLSTRINGS, 662-663 XML Schema Repository (XSR), 503-508, 667 change requests, response time for, 613 character data, as externally encoded, 619 character data types, blocking usage of, 606 character encoding. See XML encoding character references, list of, 87 character type application variables, fetching non-Unicode data into, 603-604 check constraints, 8, 520-523 CHECK DATA utility, 69-70 CHECK INDEX utility, 66 CHECK PENDING status, 110 child axis, 157 Chinese characters in code page ISO-8859-1 (code page conversion example), 602-603 CLI (Call Level Interface) application development, 636-639 CLOB data type, inserting XML documents 80 CLP (Command Line Processor) DESCRIBE command, 84-85 escaping quotes in, 571 input parameters, text files as, 708 INSERT statements, 76-77 registering XML Schemas in, 484-486 retaining whitespace, 527 terminating characters, changing, 549 testing stored procedures, 555 truncated XML document display 83 viewing XML documents, 704-705 XML declarations, inserting 86 XML options list of, 706 usage examples, 706-707 coarse granularity of XML documents, 22 COBOL applications with embedded SQL, 640-642 code page conversions, 597 avoiding, 601 examples of, 602-605 with non-Unicode database code pages, 601-602 performance considerations, 434 code pages, selecting, 27 column references dollar sign ($) in, 161 in XMLQUERY function, 162-163 columns (XML) dropping, 40 generating from XML data, 165-166
730 inserting constructed XML data into, 294-295 comma (,) operator, construction of sequences, 154 Command Line Processor. See CLP commands for full-text searches, list of, 594-595 comment nodes, constructing, 290 common table expressions, 282-283 comparison expressions, 190 comparison operations in predicates, 143 comparison operators numeric versus string comparison, 144 in XPath, 156-157 compatibility. See backward compatibility compliance, data storage for, 94 components. See schema documents compression of XML data, 48-51 XML space management example, 54-57 computed values replacing values in XML documents with, 329-331 XML construction with, 202-204 concat function (XQuery), 215-216 concat( ) function, 155 concatenation of text nodes, 30 concurrency control, XML documents, 9 conditional expressions, 190 XML construction with, 205 conditional triggers, 524 conditional XML element construction, 284-285 leading zeros in, 285-286 Index configuring XML inlining, 43-47 constraints on XML documents, 8, 520-523 constructing XML data. See converting relational data to XML data; XML construction construction of sequences in XPath, 154-155 constructor expressions, 190 constructor functions. See publishing functions (SQL/XML) constructors (XQuery), 290-292 XML namespaces and, 462-463 contains function (XQuery), 216-217, 587 CONTAINS scalar function, 581-583 content-centric XML documents, 567 context (file system navigation), 133 context nodes, 136, 139 convert function, 229 converting. See also shredding relational data to XML data, 267 inserting in XML columns, 294-295 with SQL/XML publishing functions, 268-290 XML declarations for, 292-294 with XQuery constructors, 290-292 XML elements to/from XML attributes, 345-346 XML values to binary SQL types, 187-188 COPY TABLESPACE utility, 66 copying XML documents 86 COPYTOCOPY utility, 66 count( ) function, 155 Creat Index Wizard, 366-367 CREATE INDEX command (DB2 Text Search), 591-592 CREATE INDEX command (DB2 Net Search Extender), 572-579 advanced options, 578-579 with automatic updates, 574-576 for parts of documents, 576-577 with specific storage paths, 573-574 CREATE INDEX statement, 362-364 current context in XPath, 151-153 current directory (file system navigation), 133 CURRENT IMPLICIT XMLPARSE OPTION register 93 current-date function (XQuery), 225-226 current-dateTime function (XQuery), 225 current-time function (XQuery), 225 cursors loading from, 111 in stored procedures, 553-554 update cursors, modifying XML documents in, 350-351 custom document models, fulltext searches with, 585-586 custom XML Schemas, industry standard XML Schemas versus, 474-476 customer table (XML sample database), contents of, 710-712
Index D data, distinguishing from metadata, 19-21. See also relational data; XML data data binding (XML) to Java objects, 629 pureQuery and, 631 data exchange, metadata for, 13 data expansion/shrinkage (code page conversion example), 605 data function (XQuery), 221 data loss due to XML encoding, avoiding, 606 data models. See also design decisions XML data, when to use, 11-13 XQuery 1.0 and XPath 2.0 Data Model, 126-131 sequence construction, 128-130 sequence input/output, 130-131 data providers, list of, 631 data storage. See storage Data Studio, support for DB2 pureXML, 9 Data Studio Developer, 652-655 defining schema annotations, 311 profiling stored procedures, 556 data types (SQL) BLOB, inserting XML documents, 80 CLOB, inserting XML documents, 80 converting XML values to binary SQL types, 187-188 DESCRIBE command, 84-85 index eligibility and, 374-375 731 type errors, avoiding in XMLTABLE function, 168-169 XML, 7-9, 160 for XML indexes, 367-372 DATE, 369 DECFLOAT, 369 DOUBLE, 369 rejecting invalid values, 371-372 selecting, 369-371 TIMESTAMP, 369 VARCHAR HASHED, 368-369 VARCHAR(n), 367-368 in XQuery, 208-212 data types (Java), SQLXML, 9 data( ) function, 134-135 data-centric XML documents, 567 database code page, non-Unicode database usage, 601-602 database nodes. See partitioned databases Database Partitioning Feature (DPF), 59-60 database utilities, monitoring, 427-428 database-centric validation, 545 databases disabling for DB2 Net Search Extender, 572 for DB2 Text Search, 591 enabling for DB2 Net Search Extender, 571-572 for DB2 Text Search, 590-591 XML sample database. See XML sample database DatabaseSpy, 658 DataDirect, 659 date comparisons, string comparisons versus, 210-211 date functions in XQuery, 224-226 DATE index data type, 369 DB2 .NET Data Provider, 632 DB2 Control Center Creat Index Wizard, 366-367 support for DB2 pureXML, 9 viewing XML documents, 703-704 DB2 Express-C, 196 DB2 for Linux, UNIX, and Windows, xxvi explain facility, 396-409 exporting XML documents, 98-106 importing XML documents, 106-109 index implementation, 387-390 loading XML documents, 109-111 snapshot monitor, 424-427 statistics collection in, 418-419 validation in, DB2 for z/OS versus, 543-544 XML compression, 48 XML index data types, 367 XML index statistics, 390-393 XML sample database, creating, 709-710 XML Schemas in, 510-511 XML storage, 33-41 in DB2 9.7 release, 40-41 dropping XML columns, 40 storage objects, types of, 33-35 table space page size, 36-39
732 XML-related catalog views, 661-667 SYSCAT.COLUMNS, 661-662 SYSCAT.INDEXES, 663-664 SYSCAT.INDEXXMLPATTERNS, 664-666 SYSIBM.SYSXMLPATHS, 663 SYSIBM.SYSXMLSTRINGS, 662-663 XML Schema Repository (XSR), 667 DB2 9.1 for Linux, UNIX, and Windows, XML encoding, 597 DB2 9.5 for Linux, UNIX, and Windows, XML encoding, 597 DB2 9.7 for Linux, UNIX, and Windows, optimized XML storage format, 40-41 DB2 for z/OS, xxvi explain facility, 409-416 full-text searches in, 596 loading XML documents, 114-116 statistics collection in, 417-418 unloading XML documents, 111-114 updating XML documents in, 351-352 validation in, 540-544 DB2 for Linux, UNIX, and Windows versus, 543-544 for existing XML documents, 543 with INSERT statement, 541-542 with UPDATE statement, 542-543 XML compression, 48 XML encoding, 598 XML index data types, 367 Index XML sample database, creating, 710 XML Schemas in, 510-511 XML storage, 60-73 limiting memory consumption, 71 multiple XML columns, 64 naming conventions, 64-65 offloading XML parsing, 72-73 storage objects, types of, 61-62 table space characteristics, 63 utilities for, 65-70 XML-related catalog tables, 667-673 for XML indexes, 671-672 XML Schema Repository (XSR), 672 for XML storage objects, 667-670 DB2 Net Search Extender administration commands, list of, 594-595 altering text indexes, 580 creating text indexes, 572-579 DB2 Text Search versus, 568-570 disabling databases for, 572 enabling databases for, 571-572 performing full-text searches, 581-590 reorganizing text indexes, 579-580 updating text indexes, 579-580 DB2 pureXML. See pureXML DB2 Text Search, 590 administration commands, list of, 594-595 creating text indexes, 591-592 DB2 Net Search Extender versus, 568-570 disabling databases for, 591 enabling databases for, 590-591 performing full-text searches, 592-594 db2-fn:sqlquery function, 139, 166, 227, 229-230, 582 db2-fn:xmlcolumn( ) function, 137, 166 db2-fn:xmlcolumn-contains function, 592 db2cat utility, 419-423 db2exfmt command-line tool, 396-399 db2look utility, XML documents and, 122 db2move utility, XML documents and, 123 DB2Xml class (.NET), 632-633 DB2Xml object (JDBC 3.0), benefits of, 616 DECFLOAT index data type, 369 declarations (XML), 2, 599-600 in CLI applications, 638 for constructed XML data, 292-294 in embedded SQL applications, 639 handling documents with, 85-86 declaring namespaces, 4 XML, 439-441 in SQL/XML, 451 in XML indexes, 456-460 in XMLTABLE function, 452-453 in XQuery, 448-450 XSLT, 356 DECOMPOSE XML DOCUMENT command, 312
Index DECOMPOSE XML DOCUMENTS command, 317-318 decomposing. See shredding dedicated directories, exporting XML documents to, 102-104 default namespaces (XML), renaming nodes in, 467-468 default tagging of relational data, 286-289 default whitespace preservation option, changing 93-94 default XML namespaces, 442-444 default XML Schemas, validation against with LOAD and IMPORT utilities, 532 defining XML indexes, 362-367 delete expression (XQuery), 333 DELETE operator (execution plans), 401 DELETE statement, 82-83 delete triggers, 563 deleting XML documents, 82-83 XML nodes, 333-334 delimited format files, 99 descendant axis, 157 descendant nodes, 141 descendant-or-self axis, 157 DESCRIBE command 84-85 describing queries, 137 design decisions, XML documents, 15-25, 428-429 elements versus attributes, 15-19 granularity, 22-24 hybrid storage, 24-25 performance, role of, 16 tags versus values, 19-21 detaching partitions, 57 DFETCH operator (execution plans), 413 733 digital signatures, effect of stripping whitespace on, 78 direct element construction, 171 direct element/attribute constructors (XQuery), XML namespaces and, 462-463 direct XML construction, 202 directories, exporting XML documents to, 102-104 directoryInfo UDF, 81 disabling annotated schemas for shredding, 312 databases for DB2 Net Search Extender, 572 for DB2 Text Search, 591 distinct-values function (XQuery), 221 distribution keys, 60 document ID index, 61 document models, 576-577 custom document models, full-text searches with, 585-586 document nodes, 29, 129 constructing, 294-295 Document Object Model (DOM) parsers, 610 Document Object Model fidelity, 94 document trees (XML), 28-30 storage of, 30-33 Document Type Definitions (DTDs), 501-502 document validation. See validation document-centric XML documents. See contentcentric XML documents (XML) access control, 9 attribute values, replacing, 327-328 checking for validation, 534-535 constraints, 8 constructing from multiple relational rows, 277-280 from multiple relational tables, 281-283 content-centric versus data-centric, 567 copying, 86 creating from Java application data, 627-628 db2look utility and, 122 db2move utility and, 123 deleting, 82-83 description of, 2-4 design decisions, 15-25, 428-429 elements versus attributes, 15-19 granularity, 22-24 hybrid storage, 24-25 performance, role of, 16 tags versus values, 19-21 document trees, 28-30 storage of, 30-33 element values, replacing, 326-327 elements/attributes, renaming, 334-335 escaping special characters, 87-89 exporting, 98-106 to dedicated directories, 102-104 fragments of documents, 104-105 to multiple files, 100-102 to single file, 98-100 with XML Schema information, 105-106 federating, 120-121 importing, 106-109 input files and, 107-108 performance tips, 108-109
734 Index indexing, 8 inserting, 76-82 from files, 79-82 INSERT statement, 76-79 loading in DB2 for Linux, UNIX, and Windows, 109-111 in DB2 for z/OS, 114-116 modifying in insert operations, 349-350 in queries, 346-349 in update cursors, 350-351 with XQuery Update Facility, 324-326 namespace declarations, 439-441 namespace usage examples, 444-447 nodes deleting, 333-334 inserting, 335-340 modifying multiple, 343-346 repeating/missing, 340-343 replacing, 331-332 parameter markers, replacing values with, 328 parsing, 9 avoiding in application layer, 610-612 publishing, 118-119 queries on, 8-9 removing validation, 540 replacing, 322-324 multiple values in, 328-329 values with computed values, 329-331 replicating, 118-119 retaining invalid, 519-520 retrieving, 83-85, 161-165 shredding, 10 advantages/ disadvantages of, 297-301 with annotated schema shredding, 306-318 with XMLTABLE function, 301-306 splitting, 116-118 storage. See XML storage transforming with XSLT, 352-358 traversing, 197 unloading, 111-114 updating, 433 in DB2 for z/OS, 351-352 with UDFs, 559-561 valid documents determining XML Schemas for, 538-540 well-formed documents versus, 473 validation. See validation viewing structure of, 703-705 well-formed, 76 whitespace, 89-94 changing default preservation option 93-94 preserving, 91-93 types of, 90 with XML declarations, handling, 85-86 dollar sign ($) in XML column references, 161 XQuery variable names, 196 DOM (Document Object Model) parsers, 610 dot notation in XPath, 151-153 DOUBLE index data type, 369 double slash (//) in XPath, 141-142 in XPath predicates, 146 DPF (Database Partitioning Feature), 59-60 DROP XSROBJECT command, 492 dropping check constraints, 522 XML columns, 40 DSNTIAUL command, 111-112 DSN_XMLVALIDATE function, 541-543 DTDs (Document Type Definitions), 501-502 in .NET applications, handling, 636 registering, 501 dynamic XPath expressions, 185-186 E EAV (Entity-Attribute-Value model). See Name/Value Pairs editing (Data Studio Developer) queries, 654 XML Schemas, 653 element constructors (XQuery), 290-292 element nodes, 29-30, 129 element values, returning without XML tags, 163-164 elements (XML), 2-4. See also nodes attributes versus, 15-19 constructing from relational data, 269-273 conditional construction, 284-286 empty, missing, NULL elements, 274-275 converting to/from XML attributes, 345-346 extracting repeating values, 557-558 extracting value of, 557 full names, 441-444
Index indexing, 8 inserting, defining position for, 335-336 leaf elements, 383 non-leaf elements, XML indexes on, 383-384 optional elements handling in XMLTABLE function, 167-168 schema flexibility of, 5 renaming, 334-335 repeating elements numbering rows based on, 173-174 returning multiple, 174-176 returning with XMLQUERY function, 164-165 returning with XMLTABLE function, 169-173 schema flexibility of, 5 root elements, 28 updating with stored procedures, 554-555 values replacing, 326-327 as text node concatenations, 30 XPath wildcards for, 140 embedded SQL application development, 639-647 C applications with, 645-647 COBOL applications with, 640-642 PL/1 applications with, 643-644 embedding SQL in XQuery, 227-228 empty elements (relational data), converting to XML data, 274-275 “Empty on NULL” behavior, 274 735 enabling annotated schemas for shredding, 312 databases for DB2 Net Search Extender, 571-572 for DB2 Text Search, 590-591 encoding (XML). See also Unicode code page conversions avoiding, 601 examples of, 602-605 code pages, selecting, 27 data loss, avoiding, 606 embedded SQL application development and, 639 external encoding, 599-601 internal encoding, 599-600 non-Unicode database usage, 601-602 overview, 597 encoding declaration, 599 enforcing validation with check constraints, 520-523 with triggers, 523-525 entities (XML), 87, 501 entity references, list of 87 Entity-Attribute-Value model (EAV). See Name/Value Pairs error codes explained, 258-264 SQL0104N, 500 SQL0242N, 277 SQL0401N, 186 SQL0443N 81 SQL0544N, 521 SQL0545N, 521 SQL0551N, 500 SQL1354N, 548 SQL1407N, 111 SQL16001N, 259 SQL16002N, 146, 259-260, 605 SQL16003N, 156, 169-170, 210, 213, 249, 260-261 SQL16005N, 261-262 SQL16011N, 263 SQL16015N, 262-263 SQL16061N, 144, 169, 211, 263-264, 551 SQL16075N, 136, 264 SQL16085N, 336, 339, 341-342 SQL16088N, 467 SQL16103N, 601 SQL16110N 87 SQL16168N, 600 SQL16168N 85 SQL16193N, 440 SQL16196N, 517 SQL16267N, 318 SQL16271N, 318 SQL20329N, 491 SQL20335N, 514 SQL20340N, 491 SQL20345N, 294, 337 SQL20353N, 186 SQL20412N, 604 SQL20429N, 606 SQL20432N, 498 SQLCODE -904, 71 SQLCODE 16002, 705 SQLSTATE 2200M, 519 error handling for registered XML Schemas, 490-491 in stored procedures, 551-553 for validation/parsing errors, 525-529 escaping ampersand (&), 88 less-than character (<), 87 quotes (’), 77, 88, 571 special characters, 87-89 except operator, 155 exchanging data. See data exchange
736 executing stored procedures, 547 triggers, 547 UDFs, 547 execution plans, 395-396 obtaining with db2exfmt command-line tool, 397-399 with SPUFI, 410-411 with Visual Explain tool, 400-401, 411-413 operators, list of, 401-403, 413-414 of stored procedures, 555-556 usage examples, 403-409, 414-416 existential semantics, 241, 254, 377 logical expressions and, 149 in XPath, 147-148 existing XML documents, validating, 535-538 in DB2 for z/OS, 543 expanded names of XML elements/attributes, 441-444 explain facility in DB2 for Linux, UNIX, and Windows, 396-409 db2exfmt command-line tool, 397-399 execution plan operators, 401-403 explain tables, 396-397 usage examples, 403-409 Visual Explain tool, 400-401 in DB2 for z/OS, 409-416 execution plan operators, 413-414 explain tables, 409-410 Index SPUFI, 410-411 usage examples, 414-416 Visual Explain tool, 411-413 explain tables in DB2 for Linux, UNIX, and Windows, 396-397 in DB2 for z/OS, 409-410 EXPLAIN utility, 9 explaining stored procedure statements, 555-556 explicit serialization, 83, 294 EXPORT command, 98-106 exporting XML documents, 98-106 to dedicated directories, 102-104 fragments of documents, 104-105 to multiple files, 100-102 to single file, 98-100 with XML Schema information, 105-106 extensibility in design decisions, 17 of XML, 1 eXtensible Markup Language. See XML eXtensible Stylesheet Language Transformation. See XSLT eXtensible Stylesheet Language. See XSL external DTDs, 501 external encoding of character data, 619 external XML encoding, 599-601 extracting repeating XML element values, 557-558 XML element/attribute values, 557 F -f CLP option, 708 federating XML documents, 120-121 FETCH operator (execution plans), 401 file paths. See paths file system navigation, 133 files, inserting XML documents from, 79-82 FILTER operator (execution plans), 401 filtering conditions on XMLQUERY function, 587 fine granularity of XML documents, 23 flexibility in design decisions, 17 of XML Schema, 5-6 FLWOR expressions, 190-196 comparing with XPath and SQL/XML, 196-202 for and let clauses, compared, 193-194 for and let clauses, nested, 195-196 handling repeating/ missing XML nodes, 342 join queries in, 247 in SQL/XML, 201-202 syntax of, 191-193 where and order by clauses, 194 for clause (FLWOR expressions) let clause versus, 193-194 nested, 195-196 fragments of XML documents, exporting, 104-105 full names of XML elements/attributes, 441, 443-444 full-text searches DB2 for z/OS, 596 DB2 Net Search Extender administration commands, list of, 594-595
Index altering text indexes, 580 creating text indexes, 572-579 DB2 Text Search versus, 568-570 disabling databases for, 572 enabling databases for, 571-572 performing searches, 581-590 reorganizing text indexes, 579-580 updating text indexes, 579-580 DB2 Text Search, 590 administration commands, list of, 594-595 creating text indexes, 591-592 disabling databases for, 591 enabling databases for, 590-591 performing searches, 592-594 sample table for examples, 570-571 fullselect (SQL), 555 functions XPath, 155 XQuery, 214-226 Boolean functions, 226 date and time functions, 224-226 namespace and node functions, 222-224 numeric and aggregation functions, 218-220 sequence functions, 220-222 string functions, 215-218 fuzzy searches, 586-587 737 G general comparison operators in XPath, 156 generated column, 557 GENROW operator (execution plans), 402 GET SNAPSHOT command, 425 global declarations in XML Schemas, 478 global indexes, 58 global sequences, performance optimization, 256-257 GRANT command, 499 granting XML Schema usage privileges, 499-500 granularity of XML documents, 22-24, 428, 433 grouping XML data, 233-239. See also aggregation in SQL/XML versus XQuery, 237-239 XMLTABLE function, 234-236 GUI for defining SQL/XML publishing functions, 289-290 H HADR (High Availability Disaster Recovery), 121 hashed indexes, 368 help. See technical support hierarchical data, 12 hierarchical format, XML document trees, 28-30 High Availability Disaster Recovery (HADR), 121 host variables, 183-184, 613-614 INSERT statements 78 performance considerations, 434 HSJOIN operator (execution plans), 402 HTML. See XML to HTML transformation hybrid storage, 24-25, 299, 303-305 with stored procedures, 550-553 I IBM Data Server Driver for JDBC and SQLJ. See JCC IBM Data Studio Developer, 652-655 IBM Database Add-ins for Visual Studio, 656 IBM OmniFind Text Search Server for DB2 for z/OS, 596 IBM pureXML Technical Mastery Test, 675 ibm_db2 PHP extension, 647 identifiers for XML Schemas, 483, 516 ignoring stop words, 578 implicit parsing, 516 implicit serialization, 83, 294 implicit XML parsing, 354 IMPORT command, 106-109 input files and, 107-108 LOAD command versus, 106 performance tips, 108-109 triggers and, 573 validating XML documents, 116, 530-534 against default XML Schemas, 532 against multiple XML Schemas, 530-532 against single XML Schema, 530-531 overriding XML Schema references, 532-534 schema location hints, 534
738 importing schema documents in XML Schemas, 479-482 XML documents, 106-109 input files and, 107-108 performance tips, 108-109 in-scope namespaces, 445, 455 in-scope-prefixes function, 445 including schema documents in XML Schemas, 479-482 index directories, locating with work directories, 574 index eligibility, 373-374 data types and, 374-375 parent steps and, 385-386 text nodes and, 375-376 wildcards and, 376-377 XML namespaces and, 458-459 XMLQUERY and, 385 XQuery let and return clauses, 386-387 indexes catalog tables for, 671-672 logical, 664-666 path indexes, 663 physical, 664-666 on range-partitioned tables, 58 regions indexes, 663 reorganization, 54 text indexes (DB2 Net Search Extender) altering, 580 creating, 572-579 reorganizing, 579-580 updating, 579-580 user-defined XML, 664-666 on XML documents, 8 indexes (XML) best practices, 432-433 cardinality of, 363 Index creating with DB2 Control Center, 366-367 with XML namespaces, 456-460 data types for, 367-372 DATE, 369 DECFLOAT, 369 DOUBLE, 369 rejecting invalid values, 371-372 selecting, 369-371 TIMESTAMP, 369 VARCHAR HASHED, 368-369 VARCHAR(n), 367-368 DB2 for Linux, UNIX, and Windows implementation, 387-390 defining, 362-367 explain facility. See explain facility join predicates and, 379-383 lean indexes, 365 logical and physical indexes, 389-390 on non-leaf elements, 383-384 parent steps and, 385-386 path indexes for, 387-389 query predicates and, 373-379 relational indexes versus, 361 statistics, 390-393 for structural predicates, 377-379 unique indexes, 364-365 in XMLQUERY, 385 XQuery let and return clauses, 386-387 industry standard XML Schemas, custom XML Schemas versus, 474-476 InfoSphere Data Architect, 289-290 InfoSphere Federation Server, 120 inlining, 41-48, 429-430 benefits of, 47-48 drawbacks of, 48 monitoring and configuring, 43-47 viewing percentage of, 661-662 XML space management example, 54-57 input, sequences as, 130-131 input files, IMPORT command and, 107-108 input parameters (CLP), text files as, 708 input parameters (XML) in stored procedures, 548 insert operations, modifying XML documents in, 349-350 INSERT statement, 76-79 copying XML documents, 86 preserving whitespace, 92-93 validation, 514-517 in DB2 for z/OS, 541-542 XMLTABLE function, shredding XML documents with, 301-306 insert triggers, 562-563 inserting constructed XML data into XML columns, 294-295 nodes in XML documents with namespaces, 468-469 XML data from .NET applications, 635 XML documents, 76-82 from files, 79-82 INSERT statement, 76-79 XML nodes, 335-340 insignificant whitespace 90 instances of the data model, 128
Index integer division in XQuery, 214 integration, resources for information, 726 internal DTDs, 501 internal encoding of binary data, 618 XML encoding, 599-600 intersect operator, 155 INTERSECT operator (execution plans), 413 invalid XML documents, retaining, 519-520 invalid XML index data type values, rejecting, 371-372 ISO-8859-1, Chinese characters in (code page conversion example), 602-603 items (XQuery Data Model), 129 J Japanese literal values in nonUnicode database (code page conversion example), 605 Java application development, 615-631 JDBC 3.0, XML support in, 615-619 JDBC 4.0, 9 example usage, 621-627 XML support in, 619-621 pureQuery, 629-631 XML data binding, 629 XML documents, creating from application data, 627-628 Java applications inserting XML documents from, 78-79 registering XML Schemas from, 488 739 JCC (Java Common Client), 615 JDBC registering XML Schemas with, 488 support for, 615 JDBC 3.0, XML support in, 615-619 JDBC 4.0, 9 example usage, 621-627 XML support in, 619-621 join predicates, XML indexes and, 379-383 join queries, 239 outer joins, 250-252 in SQL/XML, 242-247 XML-to-relational joins, 248-250 in XQuery, 240-242 joins best practices, 431 XML versus relational data, 7, 241 K key cardinalities in XML indexes, 390 Key-Value Pairs (KVP). See Name/Value Pairs known whitespace 90 Korean character code page conversion example, 605 KVP (Key-Value Pairs). See Name/Value Pairs L last function (XQuery), 222 last( ) function, 153 leading zeros in conditional XML element construction, 285-286 leaf elements, 383 lean XML indexes, 365 left outer joins, 250 legacy functions (SQL/XML), 290 less-than character (<), escaping, 87 let clause (FLWOR expressions) for clause versus, 193-194 nested, 195-196 let clause (XQuery), index eligibility and, 386-387 Linux. See DB2 for Linux, UNIX, and Windows list tablespaces command, 51 LIST UTILITIES command, 427 LISTDEF utility, 69 LOAD command, 109-111, 114-116 IMPORT command versus, 106 triggers and, 573 validating XML documents, 116, 530-534 against default XML Schemas, 532 against multiple XML Schemas, 530-532 against single XML Schema, 530-531 overriding XML Schema references, 532-534 schema location hints, 534 LOAD QUERY command, 428 loading XML documents in DB2 for Linux, UNIX, and Windows, 109-111 in DB2 for z/OS, 114-116 LOB storage pureXML storage versus, 10-11 for XML data, 10 local declarations in XML Schemas, 478 local indexes, 58 local names of XML elements/attributes, 441-444 local-name function (XQuery), 223
740 locale-aware Unicode collations, 252 locators, 577 locking XML documents, 9 logical expressions, 190 in XPath, 148-151 logical indexes, 664-666 XML indexes, 389-390 loops in stored procedures, 553-554 M manipulating XML data. See XML manipulation MapForce, 657 mapping path indexes for XML indexes, 387-389 paths to pathIDs, 663 relational data to XML data, GUI-based definition, 289-290 tag names to stringIDs, 31-33 XML data to relational data. See annotated schema shredding XML Schema pairs, 533 XML tags to stringIDs, 662 marshalling, 629 MDC (multidimensional clustering), 58-59 medium granularity of XML documents, 22 memory consumption, limiting in DB2 for z/OS, 71 metadata distinguishing from data, 19-21 for data exchange, 13 missing elements (relational data), converting to XML data, 274-275. See also optional elements Index missing XML nodes, handling, 340-343 mixed content, 143 in XML document trees, 29-30 modifying. See also updating multiple XML nodes, 343-346 XML documents in insert operations, 349-350 in queries, 346-349 in update cursors, 350-351 with XQuery Update Facility, 324-326 monitoring performance, 424 of database utilities, 427-428 with snapshot monitor, 424-427 XML inlining, 43-47 moving. See exporting; importing; inserting; loading; unloading multidimensional clustering (MDC), 58-59 multiple documents, constructing from queries, 253-254 multiple files, exporting XML documents to, 100-102 multiple for/let clauses (FLWOR expressions), 195-196 multiple namespaces in XML documents, 440-441 multiple nesting levels, XML construction with, 206-207 multiple node values in XML documents, replacing, 328-329 multiple relational rows, constructing XML documents from, 277-280 multiple relational tables, constructing XML documents from, 281-283 multiple repeating elements, returning, 174-176 multiple schema documents in XML Schemas, 479-482 multiple table spaces, performance and, 37 multiple XML columns in DB2 for z/OS, 64 populating, 108 multiple XML documents, shredding, 315-318 multiple XML namespaces, querying XML documents with, 454-456 multiple XML nodes, modifying, 343-346 multiple XML Schemas, validation with LOAD and IMPORT utilities, 530-532 with triggers, 524 N Name/Value Pairs (NVP), 20-21 namespace functions in XQuery, 222-224 namespaces (XML), 437-439 constructing XML data with, 460-463 creating indexes with, 456-460 declaring, 4, 439-441 for XSLT, 356 default, 442-444 full-text searches and, 588-590 querying XML data with, 447-456 updating XML data with, 463-469 usage examples, 444-447
Index XML indexes and, 432 in XML sample database tables, 710 naming conventions XML storage in DB2 for z/OS, 64-65 XML tags, 4 nested for/let clauses (FLWOR expressions), 195-196 nested predicates, 150 nested XQuery functions, 217 nesting SQL and XQuery, 257-258 XML tags, 3 XMLELEMENT functions, 270-273 nesting levels, XML construction with, 206-207 .NET application development, 631-636 ADO.NET data providers, list of, 631 inserting XML data from, 635 manipulating XML data in, 633-635 querying XML data in, 632-633 XML Schema and DTD handling, 636 Net Search Extender. See DB2 Net Search Extender node functions in XQuery, 222-224 node tests, 133 NodeID index, 62 nodes. See also partitioned databases attribute nodes, attribute values versus, 136 context nodes, 136, 139 descendant nodes, 141 document nodes, constructing, 294-295 inserting/replacing in XML documents with namespaces, 468-469 741 renaming in XML documents with default namespaces, 467-468 in XML documents with prefixed namespaces, 465-467 text nodes, index eligibility and, 375-376 types of, 28 values, replacing with computed values, 329-331 multiple values, 328-329 with parameter markers, 328 in XML documents deleting, 333-334 inserting, 335-340 modifying multiple, 343-346 repeating/missing, 340-343 replacing, 331-332 XQuery Data Model, 129 non-leaf elements, 30, 134 XML indexes on, 383-384 non-Unicode databases avoiding data loss in, 606 for XML data management, 601-602 normalization, 7 of business objects, 12 not equal (!=) comparison operator, not( ) function versus, 150 NOT operator in full-text searches, 584 not( ) function, 148, 150 not equal (!=) comparison operator versus, 150 NSE. See DB2 Net Search Extender NULL, setting XML columns to, 82 NULL elements (relational data), converting to XML data, 274-275 “NULL on NULL” behavior, 274 numbering rows based on repeating elements, 173-174 NUMBEROFMATCHES scalar function, 581-583 numeric comparisons, string comparisons versus, 144, 211-212 numeric functions in XQuery, 218-220 NVP (Name/Value Pairs), 20-21 O octets, 188 offloading XML parsing in DB2 for z/OS, 72-73 OmniFind Text Search Server for DB2 for z/OS, 596 one-to-many relationships, XML elements, 3 online table moves, 40 operators (for execution plans), 395 list of, 401-403, 413-414 usage examples, 403-409, 414-416 optimization of queries, 253-258 between predicates, 254-256 large global sequences, 256-257 nesting SQL and XQuery, 257-258 single versus multiple document construction, 253-254 optional attributes (XML), 13
742 optional elements (XML) handling in XMLTABLE function, 167-168 schema flexibility of, 5 OR operator, 149-150 in full-text searches, 583-584 order by clause (FLWOR expressions), 194 ordering result sets by XML values, 186-187 outer joins, 250-252 output, sequences as, 130-131 overriding XML Schema references in LOAD and IMPORT utilities, 532-534 <oXygen/>, 658-659 P page size of table spaces, 36-39 for XML storage, 429 page-level sampling, 419 pairs (XML Schemas), mapping, 533 parameter markers, 183-184, 613-614 INSERT statements 78 performance considerations, 434 replacing values with, 328 parent axis, 157 parent of current directory (file system navigation), 133 parent steps index eligibility and, 385-386 in XPath, 151-153 parsing, 30 avoiding in application layer, 610-612 error handling, 525-529 implicit parsing, 516 pureQuery and, 631 valid versus well-formed XML documents, 473 Index XML documents, 9 offloading in DB2 for z/OS, 72-73 performance considerations, 434 with special characters 88 partial shredding, 299 partition elimination, 57 PARTITION operator (execution plans), 413 partitioned databases, 59-60 partitioning, range, 57-58 path expressions, 190 path indexes, 35, 58, 663 for XML indexes, 387-389 pathIDs, mapping to paths, 663 paths in IMPORT command, 107 mapping to pathIDs, 663 storage paths for text indexes, 573-574 pdo_ibm PHP extension, 647 percent sign (%) in wildcard searches, 583 performance best practices, 428-435 explain facility in DB2 for Linux, UNNIX, and Windows, 396-409 in DB2 for z/OS, 409-416 importing XML documents, 108-109 LOAD command, 110 mapping tag names to stringIDs, 32 monitoring, 424 of database utilities, 427428 with snapshot monitor, 424-427 multiple table spaces and, 37 partition elimination, 57 query optimization, 253-258 between predicates, 254-256 large global sequences, 256-257 nesting SQL and XQuery, 257-258 single versus multiple document construction, 253-254 role in design decisions, 16 text indexes and, 574 of XSLT processing, 353 Perl application development, 650-651 PHP application development, 647-649 physical indexes, 664-666 physical XML indexes, 389-390 pipe (|) character union of sequences, 154 as XPath union operator, 585 PL/1 applications with embedded SQL, 643-644 plain SQL (XML data queries), 127 position( ) function, 154 positional predicates in XPath, 153-154 positional relationships in search conditions, 588 positioning inserted XML attributes, 336-337 inserted XML elements, 335-336 predicates in FLWOR expressions, 192 join predicates, XML indexes and, 379-383 query examples of, 198-199 query predicates, XML indexes and, 373-379 structural predicates, XML indexes for, 377-379
Index usage with SQL/XML, 177-181 common mistakes, 181-182 XML construction with, 204-205 in XPath, 142-146 dot notation, 151-153 existential semantics, 147-148 logical expressions, 148-151 positional predicates, 153-154 prefixed namespaces (XML), 438-439 mixing with default XML namespaces, 442 renaming nodes in, 465-467 PreparedStatement interface (JDBC 3.0), 618 preserving whitespace, 91-93 changing default, 93-94 during import, 108 validation and, 517 pretty print, CLP option for, 707 primary schema documents, 481 privileges for XML Schema usage, granting/revoking, 499-500 processing instruction nodes, constructing, 290 product table (XML sample database), contents of, 712-713 profiling stored procedures, 556 prototyping, XML flexibility for, 612-613 proximity searches, 586 publishing functions (SQL/XML), 160, 268-290 combining with XQuery constructors, 292 743 empty, missing, NULL elements, 274-275 GUI-based definition, 289-290 legacy functions, 290 list of, 268 XML namespaces and, 460-462 XMLAGG, 277-283 XMLAGG, XMLCONCAT, XMLFOREST compared, 284 XMLATTRIBUTES, 275-277 XMLCOMMENT, 290 XMLCONCAT, 270 XMLELEMENT, 269-273 XMLFOREST, 272-273 XMLGROUP, 286-289 XMLPI, 290 XMLROW, 286-289 XMLTEXT, 290 publishing XML documents, 118-119 purchaseorder table (XML sample database), contents of, 713-714 pureQuery, 629-631 pureXML for application development, benefits of, 610-613 functionality of, xxiii-xxiv, 7-10 quiz, 675-702 XML data storage methods versus, 10-11 Q Q Apply, 119 -q CLP option, 527, 706 Q replication, 119 queries. See also querying XML data against XSR (XML Schema Repository), 508-510 editing in Data Studio Developer, 654 query predicates, XML indexes and, 373-379 querying XML data, 8-9 best practices, 430-432 case-insensitive queries, 252-253 error codes, 258-264 execution plans, 395-396 explain facility in DB2 for Linux, UNIX, and Windows, 396-409 in DB2 for z/OS, 409-416 grouping and aggregation, 233-239 in SQL/XML versus XQuery, 237-239 within and across documents, 236-237 XMLTABLE function, 234-236 join queries, 239 in SQL/XML, 242-247 in XQuery, 240-242 outer joins, 250-252 XML-to-relational joins, 248-250 methods of, 126-127 in .NET applications, 632-633 overview, 126-128 performance optimization, 253-258 between predicates, 254-256 large global sequences, 256-257 nesting SQL and XQuery, 257-258 single versus multiple document construction, 253-254
744 Index sample data for examples, 131-132 SQL/XML, 159-160 converting XML values to binary SQL types, 187-188 dynamic XPath expressions, 185-186 host variables, 183-184 namespace declarations, 451 ordering result sets, 186-187 overview, 160 parameter markers, 183-184 performance considerations, 434 retrieving XML documents, 161-165 retrieving XML values in relational format, 165-176 XPath predicate usage, 177-182 with XML namespaces, 447-456 XPath axes, 157 comparison operators, 156-157 construction of sequences, 154-155 data( ) function, 134-135 dot notation, 151-153 double slash (//), 141-142 empty results, reasons for, 134 executing in DB2, 137-140 existential semantics, 147-148 file system navigation analogy, 133 functions, 155 logical expressions, 148-151 node tests, 133 positional predicates, 153-154 predicates, 142-146 simple query examples, 133-136 slash (/), 141 string( ) function, 135 text( ) node test, 134 unabbreviated syntax, 157 union of sequences, 154-155 wildcards, 140-141 XQuery arithmetic expressions, 212-214 attribute expressions in XML construction, 206 comparing FLWOR expressions, XPath, SQL/XML, 196-202 computed value XML construction, 202-204 conditional expressions in XML construction, 205 data types, cast expressions, type errors, 208-212 direct XML construction, 202 embedding SQL in, 227-228 FLWOR expressions, 191-196 functions, 214-226 modifying XML documents in, 346-349 multiple nesting levels in XML construction, 206-207 namespace and node functions, 445 namespace declarations, 448-450 overview, 190 predicates in XML construction, 204-205 SQL functions and UDFs in, 229-230 XML aggregation in XML construction, 207-208 XQuery Data Model, 128-131 sequence construction, 128-130 sequence input/output, 130-131 question mark (?) as wildcard character, 594 questions. See technical support quiz on pureXML, 675-702 quotes (’), escaping, 77, 88, 571 R range partitioning, 57-58 rapid prototyping, 612-613 RDA (Rational Data Architect), 289 REAL TIME STATISTICS utility, 66 REC2XML function (SQL/XML), 290 RECOVER INDEX utility, 66 RECOVER TABLESPACE utility, 66 referencing XML columns. See XML column references XML Schemas, 484 referential integrity of XML documents, 8 regions, 34-35 page size and, 36
Index regions indexes, 34, 58, 663 REGISTER XMLSCHEMA command, 311, 484 registering annotated schemas, 311-312 DTDs, 501 XML Schemas, 483-491 in CLP (command-line processor), 484-486 error handling for, 490-491 identifiers, 483 with JDBC, 488 with shared schema documents, 489-490 steps in, 483 with stored procedures, 486-487 relational data converting to XML data, 267 inserting in XML columns, 294-295 with SQL/XML publishing functions, 268-290 XML declarations for, 292-294 with XQuery constructors, 290-292 converting XML documents to advantages/ disadvantages, 297-301 with annotated schema shredding, 306-318 with XMLTABLE function, 301-306 generating Java classes from, 629-631 hybrid storage, 24-25 XML versus, 4-7 when to use XML data, 11-13 XML-to-relational joins, 248-250 745 relational format, retrieving XML values in, 165-176 relational indexes, XML indexes versus, 361 relational joins, XML joins versus, 241 relational views over XML data, 305-306 relationships, one-to-many, 3 removing. See also deleting; stripping validation from XML documents, 540 XML Schemas from XSR, 492-493 renaming nodes in XML documents with default namespaces, 467-468 in XML documents with prefixed namespaces, 465-467 XML elements/attributes, 334-335 REORG command, 53-54, 68-69 reorganizing text indexes with DB2 Net Search Extender, 579-580 XML indexes, 433 XML space management example, 54-57 XML table data, 53-54, 68-69 repeating elements (XML), 3 extracting values of, 557-558 numbering rows based on, 173-174 returning multiple elements, 174-176 with XMLQUERY function, 164-165 with XMLTABLE function, 169-173 schema flexibility of, 5 repeating XML nodes, handling, 340-343 replace expression (XQuery), 331 replacing. See also updating nodes in XML documents with namespaces, 468-469 XML attribute values, 327-328 XML documents, 322-324 XML element values, 326-327 XML node values with computed values, 329-331 multiple node values, 328-329 with parameter markers, 328 XML nodes, 331-332 replicating XML documents, 118-119 REPORT TABLESPACESET utility, 67-68 reserved characters. See special characters RESET MONITOR command, 425 resources for information, 717-726 on Altova XML tools, 658 response time for change requests, 613 result set cardinalities, 200-201 result sets, ordering by XML values, 186-187 ResultSet interface (JDBC 3.0), 615 retaining invalid XML documents, 519-520 whitespace in CLP, 527
746 retrieving XML documents, 83-85, 161-165 XML values in relational format, 165-176 return clause (XQuery), index eligibility and, 386-387 RETURN operator (execution plans), 402 returning element values without XML tags, 163-164 revised XML Schemas. See XML Schema evolution REVOKE command, 500 revoking XML Schema usage privileges, 499-500 RIDSCN operator (execution plans), 402 right outer joins, 251 root elements (XML), 4, 28 row-level sampling, 419 rows generating from XML data, 165-166 numbering based on repeating elements, 173-174 RPD operator (execution plans), 402 RUNSTATS INDEX utility, 67 RUNSTATS TABLESPACE utility, 67 RUNSTATS utility, 9, 50, 417, 666 in DB2 for Linux, UNIX, and Windows, 418-419 in DB2 for z/OS, 417-418 S sample database. See XML sample database sampling in statistics collection, 419 SAX (Simple API for XML) parsers, 611 Index scalar functions, 162, 200, 557 for full-text searches, 581-583 scalar subselects, 282 schema documents, 476 multiple schema documents in XML Schemas, 479-482 sharing between XML Schemas, 489-490 schema location hints in LOAD and IMPORT utilities, 534 schema names, comparison with XML namespaces, 438 schema validation. See validation schemas (XML). See also XML Schema best practices, 434 volatility of, 12 SCORE scalar function, 581-583 search conditions. See also predicates parts of, 582 positional relationships in, 588 search term (in search conditions), 582 searches. See full-text searches section (in search conditions), 582 SELECT statement, retrieving XML documents, 83-85 selecting code pages, 27 XML index data types, 369-371 self axis, 157 self-describing data format, XML as, 19 self-joins, 228 semicolon (;) in namespace declarations, 448 in stored procedures, 549 sequence constructors, 175 sequence expressions, 190 sequence functions in XQuery, 220-222 sequences, 550 constructing, 128-130 global sequences, performance optimization, 256-257 as input/output, 130-131 in XPath, 154-155 serialization, 30, 83, 138 SET INTEGRITY command, 110 SET INTEGRITY PENDING status, 110 sharing schema documents between XML Schemas, 489-490 SHIP operator (execution plans), 402 shredding pureXML storage versus, 10-11 XML data with UDFs, 558-559 XML documents, 10 advantages/disadvantages of, 297-301 with annotated schema shredding, 306-318 with XMLTABLE function, 301-306 sibling branches, search conditions on, 588 significant whitespace 90 Simple API for XML (SAX) parsers, 611 SimpleXML PHP extension, 647 single documents, constructing from queries, 253-254 single quotes (’), escaping, 77, 88, 571 size. See granularity
Index slash (/) in file system navigation, 133 in XPath, 141 in XPath predicates, 145 snapshot monitor, 424-427 snapshot semantics, 343-345 SNAPTAB_REORG administrative view, 428 SNAPUTIL administrative view, 427 SNAPUTIL_PROGRESS administrative view, 427 SORT operator (execution plans), 402 sparse attributes, 13 special characters, escaping, 87-89 splitting XML documents, 116-118 SPUFI execution plans, obtaining, 410-411 viewing XML documents, 705 SQL. See also SQL/XML embedding in XQuery, 127, 227-228 nesting with XQuery, 257-258 scalar functions for full-text searches, 581-583 stored procedures. See stored procedures for XML data queries, 127 SQL functions in XQuery, 229-230 SQL statements, embedding XPath/XQuery in, 127 SQL/XML, 8, 159-160 comparing with FLWOR expressions and XPath, 196-202 converting XML values to binary SQL types, 187-188 747 dynamic XPath expressions, 185-186 FLWOR expressions in, 201-202 grouping queries in, XQuery versus, 237-239 host variables, 183-184 INSERT statement, validation on, 514-517 join queries, XML-to-XML joins, 242-247 namespace declarations, 451 ordering result sets, 186-187 overview, 160 parameter markers, 183-184 performance considerations, 434 publishing functions, 268-290 combining with XQuery constructors, 292 empty, missing, NULL elements, 274-275 GUI-based definition, 289-290 legacy functions, 290 list of, 268 XML namespaces and, 460-462 XMLAGG, 277-283 XMLAGG, XMLCONCAT, XMLFOREST compared, 284 XMLATTRIBUTES, 275-277 XMLCOMMENT, 290 XMLCONCAT, 270 XMLELEMENT, 269-273 XMLFOREST, 272-273 XMLGROUP, 286-289 XMLPI, 290 XMLROW, 286-289 XMLTEXT, 290 UPDATE statement, validation on, 518-519 XML aggregation, XML construction with, 207-208 XML documents, retrieving, 161-165 XML values, retrieving in relational format, 165-176 XPath and XQuery versus, 201 XPath predicate usage, 177-181 common mistakes, 181-182 SQL0104N error code, 500 SQL0242N error code, 277 SQL0401N error code, 186 SQL0443N error code 81 SQL0544N error code, 521 SQL0545N error code, 521 SQL0551N error code, 500 SQL1354N error code, 548 SQL1407N error code, 111 SQL16001N error code, 259 SQL16002N error code, 146, 259-260, 605 SQL16003N error code, 156, 169-170, 210, 213, 249, 260-261 SQL16005N error code, 261-262 SQL16011N error code, 263 SQL16015N error code, 262-263 SQL16061N error code, 144, 169, 211, 263-264, 551 SQL16075N error code, 136, 264 SQL16085N error code, 336, 339, 341-342 SQL16088N error code, 467 SQL16103N error code, 601 SQL16110N error code 87 SQL16168N error code, 600 SQL16168N error code 85
748 SQL16193N error code, 440 SQL16196N error code, 517 SQL16267N error code, 318 SQL16271N error code, 318 SQL20329N error code, 491 SQL20335N error code, 514 SQL20340N error code, 491 SQL20345N error code, 294, 337 SQL20353N error code, 186 SQL20412N error code, 604 SQL20429N error code, 606 SQL20432N error code, 498 SQLCODE -904 error code, 71 SQLCODE 16002 error code, 705 SQLSTATE 2200M error code, 519 SQLXML interface (JDBC 4.0), 619-621 SQLXML Java data type, 9 star. See asterisk (*) starts-with function (XQuery), 218 statement heap, size of, 432 statistics db2cat utility, 419-423 RUNSTATS utility, 417 in DB2 for Linux, UNIX, and Windows, 418-419 in DB2 for z/OS, 417-418 for XML indexes, 390-393 StAX (Streaming API for XML) parsers, 611 stemming in full-text searches, 586 steps (file system navigation), 133 stop words, ignoring, 578 storage. See also data storage of business objects, 612 for compliance 94 hybrid XML data storage with stored procedures, 550-553 Index pureXML versus alternative XML storage methods, 10-11 of XML document trees, 30-33 XML storage, 429-430 in DB2 for Linux, UNIX, and Windows, 33-41 in DB2 for z/OS, 60-73 inlining, 41-48 MDC (multidimensional clustering), 58-59 partitioned databases, 59-60 range partitioning, 57-58 space consumption of, 51-53 space management example, 54-57 storage objects catalog tables for, 667-670 in DB2 for Linux, UNIX, and Windows, types of, 33-35 in DB2 for z/OS, types of, 61-62 storage paths for text indexes, 573-574 stored procedures, 548-556 benefits of, 547 for dynamic XPath expressions, 185-186 executing, 547 for hybrid XML data storage, 550-553 loops and cursors, 553-554 registering XML Schemas with, 486-487 retaining invalid XML documents, 519-520 for shredding XML documents, 313-315 testing, 555-556 updating XML elements/attributes, 554-555 Streaming API for XML (StAX) parsers, 611 string comparisons case-insensitivity, 252-253 date comparisons versus, 210-211 numeric comparisons versus, 144, 211-212 string functions in XQuery, 215-218 string( ) function, 135 string-join function, 171, 216 stringIDs, mapping tag names to, 31-33 to XML tags, 662 stripping whitespace, 78 changing default, 93-94 structural predicates, 147, 153 XML indexes for, 377-379 structure of XML documents, viewing, 703-705 style sheets. See XSLT StyleVision, 657 Stylus Studio, 659 subselects, 282 substring-after function (XQuery), 217 synchronous index maintenance, 361 syntax of FLWOR expressions, 191-193 SYSCAT.COLUMNS catalog view, 661-662 SYSCAT.INDEXES catalog view, 663-664 SYSCAT.INDEXXMLPATTERNS catalog view, 664-666 SYSCAT.XDBMAPGRAPHS catalog view, 504, 508, 667 SYSCAT.XDBMAPSHREDTREES catalog view, 504, 508, 667 SYSCAT.XSROBJECTAUTH catalog view, 504, 507, 667
Index SYSCAT.XSROBJECTCOMPONENTS catalog view, 504, 506, 667 SYSCAT.XSROBJECTDEP catalog view, 504, 507, 667 SYSCAT.XSROBJECTHIERARCHIES catalog view, 504, 506, 667 SYSCAT.XSROBJECTS catalog view, 503, 505, 667 SYSIBM.SYSINDEXES catalog table, 671 SYSIBM.SYSKEYTARGETS catalog table, 671-672 SYSIBM.SYSTABLES catalog table, 668-670 SYSIBM.SYSTABLESPACE catalog table, 670 SYSIBM.SYSXMLPATHS catalog view, 663 SYSIBM.SYSXMLRELS catalog table, 667-668 SYSIBM.SYSXMLSTRINGS catalog table, 668 SYSIBM.SYSXMLSTRINGS catalog view, 662-663 SYSIBMTS.TSCOLLECTIONNAMES table, 591 SYSIBMTS.TSCONFIGURATION table, 591 SYSIBMTS.TSDEFAULTS table, 591 SYSIBMTS.TSINDEXES table, 591 SYSIBMTS.TSLOCKS table, 591 SYSIN cards, unloading large XML documents, 113 SYSSTAT.INDEXES catalog view, 391 system sampling, 419 System z Application Assist Processors (zAAP), 71-72 System z Integrated Information Processors (zIIP), 71-72 749 T table functions, 200, 557 table partitioning. See range partitioning table spaces characteristics in DB2 for z/OS, 63 defined, 34 page size, 36-39 XML storage, 51-53, 429 tables. See also catalog tables online table moves, 40 reorganizing XML data, 53-54, 68-69 XML columns, dropping, 40 in XML sample database in DB2 for Linux, UNIX, and Windows, 710 in DB2 for z/OS, 710 tags (XML), 1-4 mapping to stringIDs, 31-33, 662 returning element values without, 163-164 values versus, 19-21 target namespaces, 438 for XML Schemas, 476 TBSCAN operator (execution plans), 402 technical support, xxvi. See also resources for information TEMP operator (execution plans), 402 terminating characters changing, 549 CLP option for, 706 test on pureXML, 675-702 testing stored procedures, 555-556 text files as input parameters for CLP, 708 text indexes (DB2 Net Search Extender) altering, 580 creating, 572-579, 591-592 reorganizing, 579-580 updating, 579-580 text nodes, 29-30 concatenation, 30 constructing, 290 index eligibility and, 375-376 text searches. See DB2 Text Search; full-text searches text( ) node test, 134 time functions in XQuery, 224-226 time zone indicators, 210 TIMESTAMP index data type, 369 tokenize function (XQuery), 217-218 TQ operator (execution plans), 402 transform expression (XQuery), 190, 325 XML attribute values, replacing, 327-328 XML element values, replacing, 326-327 XML node values, replacing with computed values, 329-331 multiple values, 328-329 with parameter markers, 328 transformation functions, 579 transforming XML documents, 352-358 with XQuery, 203 transition variables, 561 transitivity of value comparisons, 156 translate function (XQuery), 218 traversing XML documents, 197
750 trees of nodes, 28-30 storage of, 30-33 triggers, 523-525, 561-564 delete triggers, 563 executing, 547 IMPORT utility and, 573 insert triggers, 562-563 LOAD utility and, 573 update triggers, 564 troubleshooting empty XPath query results, 134 SQL/XML predicates, 181-182 truncated XML document display, 83 avoiding, 138 type constructors, 212 type errors avoiding in XMLTABLE function, 168-169 in XQuery, 208-212 U UCA (Unicode Collation Algorithm), 252 UDFs (user-defined functions), 547, 556-561 benefits of, 547 executing, 547 extracting repeating XML element values, 557-558 XML element/attribute values, 557 inserting XML documents from files, 79-82 shredding XML data, 558-559 updating XML documents, 559-561 in XQuery, 229-230 unabbreviated syntax in XPath, 157 underscore character (_) in wildcard searches, 583 Index Unicode explained, 598 locale-aware collations, 252 UTF-8, 27, 597-598 UTF-16, 598 UTF-32, 598 Unicode Byte-Order Mark (BOM), 599 Unicode Collation Algorithm (UCA), 252 union keyword, 154 union of sequences in XPath, 154-155 UNION operator (execution plans), 402 union operator (|) in XPath, 585 UNIONA operator (execution plans), 414 UNIQUE operator (execution plans), 402 unique XML indexes, 364-365 Universal Resource Identifier (URI), 438-439 UNIX. See DB2 for Linux, UNIX, and Windows UNLOAD utility, 67, 112 unloading XML documents, 111-114 unmarshalling, 629 update cursors, modifying XML documents in, 350-351 UPDATE INDEX command (DB2 Net Search Extender), 579-580 UPDATE operator (execution plans), 414 UPDATE statement. See also XQuery Update Facility replacing XML documents, 322-324 validation, 518-519, 542-543 update triggers, 564 UPDATE XMLSCHEMA command, XML Schema evolution with, 495-498 updating. See also modifying; replacing text indexes automatic updates, 574-576 with DB2 Net Search Extender, 579-580 XML data with XML namespaces, 463-469 XML documents, 433 in DB2 for z/OS, 351-352 with UDFs, 559-561 XML elements/attributes with stored procedures, 554-555 upper-case function (XQuery), 221 upper-case( ) function, 155 “upsert” operations, 342, 560 URI (Universal Resource Identifier), 438-439 USC-2, 598 user-defined functions. See UDFs user-defined XML indexes, 664-666 UTF-8, 27, 597-598 UTF-16, 598 UTF-32, 598 utilities monitoring performance of, 427-428 XML support in DB2 for z/OS, 65-67 CHECK DATA utility, 69-70 REORG utility, 68-69 REPORT TABLESPACESET utility, 67-68 V -v CLP option, 708 valid XML documents determining XML Schemas for, 538-540
Index well-formed documents versus, 473 validation, 8, 473 application-centric versus database-centric, 545 checking XML documents for, 534-535 in DB2 for z/OS, 540-544 DB2 for Linux, UNIX, and Windows versus, 543-544 for existing XML documents, 543 with INSERT statement, 541-542 with UPDATE statement, 542-543 dropped XML Schemas and, 493 during loading or importing, 116 during shredding process, 312 enforcing with check constraints, 520-523 with triggers, 523-525 error handling, 525-529 of existing XML documents, 535-538 on INSERT, 514-517 with LOAD and IMPORT utilities, 530-534 against default XML Schemas, 532 against multiple XML Schemas, 530-532 against single XML Schema, 530-531 overriding XML Schema references, 532-534 schema location hints, 534 performance considerations, 434 removing from XML documents, 540 751 retaining invalid XML documents, 519-520 space consumption and, 51 on UPDATE, 518-519 when to use, 474 whitespace preservation and, 517 XML Schema evolution with/without, 494-495 value comparison operators in XPath, 156-157 value predicates, 147, 153 values attribute values, attribute nodes versus, 136 of repeating XML elements, extracting, 557-558 updating in XML documents with namespaces, 464-465 of XML attributes extracting, 557 replacing, 327-328 of XML elements extracting, 557 replacing, 326-327 of XML nodes, replacing with computed values, 329-331 multiple values, 328-329 with parameter markers, 328 values (XML) converting to binary SQL types, 187-188 ordering result sets by, 186-187 retrieving in relational format, 165-176 tags versus, 19-21 values (XQuery Data Model), 128 VARCHAR HASHED index data type, 368-369, 433 VARCHAR(n) index data type, 367-368 variables host variables, 613-614 in stored procedures, 548 viewing XML document structure, 703-705 views. See catalog views; relational views Visual Explain tool, 396 execution plans, obtaining, 400-401, 411-413 Visual Studio, IBM Database Add-ins for Visual Studio, 656 volatility of schema, 12 W WebSphere Replication Server, 119 well-formed XML documents, 4, 76 valid documents versus, 473 where clause (FLWOR expressions), 194 whitespace in XML documents, 89-94 changing default preservation option, 93-94 data storage for compliance, 94 preserving, 91-93 types of, 90 preserving during import, 108 validation and, 517 retaining in CLP, 527 stripping, 78 wildcard searches, 583 wildcards in full-text searches, 594 index eligibility and, 376-377
752 for namespace queries, 449-450 in XPath, 140-141 Windows. See DB2 for Linux, UNIX, and Windows work directories, locating with index directories, 574 X XANDOR operator (execution plans), 402-403 XDBDECOMPXML stored procedure, 313-314 XDB_DECOMP_XML_ FROM_QUERY stored procedure, 315-317 XDS (XML Data Specifiers), 99 XISCAN operator (execution plans), 403 XIXAND operator (execution plans), 414 XIXOR operator (execution plans), 414 XIXSCAN operator (execution plans), 414 XML (eXtensible Markup Language), 1 application development. See application development applications, best practices, 434-435 attributes. See attributes (XML) CLP options list of, 706 usage examples, 706-707 documents. See documents (XML) for data exchange, 1 for data storage, 2 pureXML versus alternative storage methods, 10-11 Index indexes. See indexes (XML) monitoring performance, 424 of database utilities, 427-428 with snapshot monitor, 424-427 namespaces. See namespaces (XML) performance. See performance pureXML. See pureXML relational data versus, 4-7 when to use XML data, 11-13 reorganizing table data, 53-54, 68-69 schemas, best practices, 434 as self-describing data format, 19 as standard, xxiii, xxv, 1 tags. See tags (XML) values converting to binary SQL types, 187-188 ordering result sets by, 186-187 retrieving in relational format, 165-176 tags versus, 19-21 XML 1.0 standard, 2 XML 1.1 standard, 2 XML aggregation. See aggregation XML column references. See column references (XML) XML columns. See columns (XML) XML compression. See compression XML construction with attribute expressions, 206 with computed values, 202-204 with conditional expressions, 205 direct XML construction, 202 with multiple nesting levels, 206-207 with predicates, 204-205 with XML aggregation, 207-208 with XML namespaces, 460-463 XML data converting relational data to, 267 inserting in XML columns, 294-295 with SQL/XML publishing functions, 268-290 XML declarations for, 292-294 with XQuery constructors, 290-292 generating rows/columns from, 165-166 querying. See querying XML data statistics collection in DB2 for Linux, UNIX, and Windows, 418-419 in DB2 for z/OS, 417-418 with db2cat utility, 419-423 truncation, avoiding, 138 XML data binding to Java objects, 629 pureQuery and, 631 XML Data Specifiers (XDS), 99 XML data type, 7-9, 160 XML declarations. See declarations (XML) XML document trees, 28-30 storage of, 30-33 XML encoding. See encoding (XML); Unicode
Index XML joins, relational joins versus, 7, 241 XML manipulation in .NET applications, 633-635 in stored procedures, 548-556 hybrid XML data storage, 550-553 loops and cursors, 553-554 testing, 555-556 updating XML elements/attributes, 554-555 with triggers, 561-564 delete triggers, 563 insert triggers, 562-563 update triggers, 564 in UDFs, 556-561 extracting repeating XML element values, 557-558 extracting XML element/attribute values, 557 shredding XML data, 558-559 updating XML documents, 559-561 XML predicates. See predicates XML publishing functions. See publishing functions (SQL/XML) XML sample database creating, 709-710 customer table contents, 710-712 product table contents, 712-713 purchaseorder table contents, 713-714 XML Schema, xxiii, 2, 471 annotated schema shredding, 306-318 753 advantages/disadvantages of, 301 annotating XML Schema, 306-310 defining annotations in Data Studio Developer, 311 registering annotated schemas, 311-312 shredding multiple XML documents, 315-318 shredding single XML documents, 312-315 custom versus industry standard, 474-476 DB2 for z/OS versus DB2 for Linux, UNIX, and Windows, 510-511 determining for validated XML documents, 538-540 DTDs versus, 501 editing in Data Studio Developer, 653 exporting information with db2look utility, 122 XML documents containing, 105-106 flexibility of, 5-6 granting/revoking usage privileges, 499-500 identifiers, 516 with multiple schema documents, 479-482 in .NET applications, handling, 636 as optional in DB2, 8 parts of, 476-478 reasons for using, 472-473 referencing, 484 registering, 483-491 in CLP (command-line processor), 484-486 error handling for, 490-491 identifiers, 483 with JDBC, 488 with shared schema documents, 489-490 steps in, 483 with stored procedures, 486-487 removing from XSR, 492-493 target namespaces, 438 valid versus well-formed XML documents, 473 validation. See validation when to validate, 474 XML Schema evolution, 493-498 with document validation, 494-495 with UPDATE XMLSCHEMA command, 495-498 without document validation, 494 XML Schema Repository (XSR), 483, 502-503, 667, 672. See also registering XML Schemas catalog tables/views, 503-508 queries against, 508-510 registering annotated schemas, 311-312 XML storage, 429-430 for compliance 94 in DB2 for Linux, UNIX, and Windows, 33-41 dropping XML columns, 40 in DB2 9.7 release, 40-41 storage objects, types of, 33-35 table space page size, 36-39 in DB2 for z/OS, 60-73 CHECK DATA utility, 69-70 limiting memory consumption, 71
754 multiple XML columns, 64 naming conventions, 64-65 offloading XML parsing, 72-73 REORG utility, 68-69 REPORT TABLESPACESET utility, 67-68 storage objects, types of, 61-62 table space characteristics, 63 utilities for, 65-67 inlining, 41-48 benefits of, 47-48 drawbacks of, 48 monitoring and configuring, 43-47 MDC (multidimensional clustering), 58-59 partitioned databases, 59-60 range partitioning, 57-58 space consumption of, 51-53 space management example, 54-57 XML System Services (XMLSS), 72 XML to HTML transformation, 356-358 XML-related catalog tables, 667-673 for XML indexes, 671-672 XML Schema Repository (XSR), 672 for XML storage objects, 667-670 XML-related catalog views, 661-667 SYSCAT.COLUMNS, 661-662 SYSCAT.INDEXES, 663-664 Index SYSCAT.INDEXXMLPATTERNS, 664-666 SYSIBM.SYSXMLPATHS, 663 SYSIBM.SYSXMLSTRINGS, 662-663 XML Schema Repository (XSR), 667 XML-to-relational joins, 239, 248-250 XML-to-XML joins, 239 outer joins, 250-252 in SQL/XML, 242-247 in XQuery, 240-242 XML2CLOB function (SQL/XML), 290 xml:space attribute, 91-92 XMLAGG function (SQL/XML), 160, 207, 277-283 XMLCONCAT, XMLFOREST compared, 284 XMLATTRIBUTES function (SQL/XML), 160, 275-277 XMLCAST function, 119, 160, 163, 186-187 code page conversion example, 604 XMLCOMMENT function (SQL/XML), 290 XMLCONCAT function (SQL/XML), 270 XMLAGG, XMLFOREST compared, 284 XmlDocument class (.NET), 634 XMLDOCUMENT function, 117, 119, 294-295 XMLELEMENT function (SQL/XML), 160, 268-273, 460-462 XMLEXISTS predicate, 160, 177-182, 188, 431 XMLFOREST function (SQL/XML), 272-273 XMLAGG, XMLCONCAT compared, 284 XMLGROUP function (SQL/XML), 286-289 XMLNAMESPACES function, 453, 460-462 XMLPARSE function, 92-93, 119, 160, 354 XMLPATTERN function in index definitions, 363 XMLPI function (SQL/XML), 290 XMLQUERY function, 119, 160-165, 188, 430 filtering conditions, 587 index eligibility and, 385 returning element values without XML tags, 163-164 repeating elements, 164-165 XML column references in, 162-163 XMLReader class (.NET), 634 XMLROW function (SQL/XML), 286-289 XMLSERIALIZE function, 83, 86, 119, 160, 293, 435, 640 XMLSpy, 657 XMLSS (XML System Services), 72 XMLTABLE function, 160, 165-176, 188 advantages/disadvantages of, 300 aggregation and grouping queries, 234-236 code page conversion example, 604 generating rows/columns from XML data, 165-166 namespace declarations, 452-453 numbering rows based on repeating elements, 173-174 optional elements, handling, 167-168
Index pureQuery and, 631 returning multiple repeating elements, 174-176 repeating elements, 169-173 shredding XML documents with, 301-306 splitting XML documents, 116-118 type errors, avoiding, 168-169 XMLTEXT function (SQL/XML), 290 XMLVALIDATE function, 119, 160, 514-519, 535-536 XMLXSROBJECTID function, 492, 535, 538-539 XPath, xxiii, 8, 126. See also XQuery axes, 157 comparing with FLWOR expressions and SQL/XML, 196-202 comparison operators, 156-157 construction of sequences, 154-155 data( ) function, 134-135 dot notation, 151-153 double slash (//), 141-142 dynamic expressions, 185-186 embedding in SQL statements, 127 empty results, reasons for, 134 executing in DB2, 137-140 existential semantics, 147-148 file system navigation analogy, 133 full-text searches in, 582 functions, 155 logical expressions, 148-151 755 node tests, 133 positional predicates, 153-154 predicates, 142-146 usage with SQL/XML, 177-182 sample data for examples, 131-132 simple query examples, 133-136 slash (/), 141 SQL/XML versus, 201 string( ) function, 135 text( ) node test, 134 unabbreviated syntax, 157 union of sequences, 154-155 union operator (|), 585 wildcards, 140-141 XPath expressions best practices, 430 full-text searches with, 593 XPath queries, design decisions and, 17-18 XQuery, xxiii, 8, 126. See also XPath arithmetic expressions, 212-214 attribute expressions in XML construction, 206 “between” predicates, 431 computed value XML construction, 202-204 conditional expressions in XML construction, 205 constructors, 290-292 XML namespaces and, 462-463 contains function, 587 data types, cast expressions, type errors, 208-212 direct XML construction, 202 with embedded SQL, 127 embedding in SQL statements, 127 SQL in, 227-228 FLWOR expressions, 191-196 comparing with XPath and SQL/XML, 196-202 join queries in, 247 full-text searches, 582, 592 functions, 214-226 Boolean functions, 226 date and time functions, 224-226 namespace and node functions, 222-224 numeric and aggregation functions, 218-220 sequence functions, 220-222 string functions, 215-218 grouping queries in, SQL/XML versus, 237-239 join queries, XML-to-XML joins, 240-242 let and return clauses, index eligibility and, 386-387 modifying XML documents in, 346-349 multiple nesting levels in XML construction, 206-207 namespace and node functions, 445 namespace declarations, 448-450 nesting with SQL, 257-258 outer joins, 250-252 overview, 190 predicates in XML construction, 204-205 sample data for examples, 131-132 SQL functions and UDFs in, 229-230 SQL/XML versus, 201
756 as stand-alone language, 127 in stored procedures, 554 XML aggregation in XML construction, 207-208 XSLT versus, 353 XQuery 1.0 and XPath 2.0 Data Model, 126, 128-131 sequences constructing, 128-130 as input/output, 130-131 xquery keyword, 137 XQuery Update Facility, 9, 324-326 XML attribute values, replacing, 327-328 XML element values, replacing, 326-327 XML elements/attributes, renaming, 334-335 XML node values, replacing with computed values, 329-331 multiple values, 328-329 with parameter markers, 328 Index XML nodes deleting, 333-334 inserting, 335-340 modifying multiple, 343-346 repeating/missing, 340-343 replacing, 331-332 XSCAN operator (execution plans), 402 XSL (eXtensible Stylesheet Language), 352 XSLT (eXtensible Stylesheet Language Transformation), 352-358 XML to HTML transformation, 356-358 XQuery versus, 353 XSLTRANSFORM function, 353-356 XSLTRANSFORM function, 352-356 XSR (XML Schema Repository), 483, 502-503, 667, 672. See also registering XML Schemas catalog tables/views, 503-508 queries against, 508-510 registering DTDs, 501 removing XML Schemas from, 492-493 XSR Objects, 483 XSR_GET_PARSING_ DIAGNOSTICS stored procedure, 525-528 Y–Z z/OS. See DB2 for z/OS zAAP (System z Application Assist Processors), 71-72 zeros, leading zeros in XML element construction, 285-286 zIIP (System z Integrated Information Processors), 71-72