The intelligence of machines and the branch of computer science which aims to create it

Artificial Intelligence Journal

Subscribe to Artificial Intelligence Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Artificial Intelligence Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Artificial Intelligence Authors: Liz McMillan, Elizabeth White, Corey Roth, Yeshim Deniz, Pat Romanski

Related Topics: Artificial Intelligence Journal, XML Magazine

Artificial Intelligence: Article

Trends in High Volume XML Publishing

Trends in High Volume XML Publishing

Integrating efficient XML publishing into high-volume content environments remains a significant challenge. Among the many real-world barriers: the need to convert quantities of paper and other legacy documents and to integrate easy-to-use XML publishing tools into the content-creation process, and the lack of workflow management tools necessary for mass conversion environments.

In many environments content creators resist using XML authoring tools, preferring traditional word processing or desktop publishing applications, and simplified "template-style" DTDs are used to accommodate productivity requirements. Consequently, high-volume XML conversions are typically accomplished through "brute force" solutions, where mass OCR (optical character recognition) scanning and tagging are done through expensive outsourcing, often to developing countries where labor for repetitive high-volume publishing tasks is plentiful and inexpensive.

To realize the full benefits of XML for both highly structured and mixed-structure content in high volumes - without the cost, cycle time requirements, and other outcomes inherent in outsourcing - an XML publishing system that minimizes the ongoing intervention of XML programmers is essential. To efficiently convert documents in Word, HTML, PDF, RTF, or other formats into XML, intelligent, rule-based automated markup solutions are required.

For large-volume projects, efficient XML publishing also requires batch processing and workflow management solutions that optimize productivity. In this article I'll discuss the process requirements for high-volume XML creation and introduce new tools and technologies expressly developed for these mass-conversion environments.

The High-Volume XML Publishing Challenge
Any document can be represented in XML, but document types vary widely, creating diverse challenges for high-volume publishing requirements. Documents can possess data that ranges from highly consistent and repetitive to extremely diverse content structures that defy accurate digital representation. For example, accident reports, product catalogs, employment applications, financial forms, and other types of documents are very amenable to automated XML conversion solutions. On the other hand, dissertations, marketing reports, résumés, news articles, and other documents feature abstract intellectual information with highly diverse components and inconsistent composition.

For highly structured content, identifying and tagging variables can easily be automated through forms, scripts, and other techniques, but mixed-structure data requires an XML authoring tool or a postauthoring conversion process. The assignment of tags to elements in a document is fundamentally a separate and distinct exercise from the authoring process. Authors may intuitively recognize elements - such as a "phone number," "chapter heading," "ingredient," or "customer type," but their identification as an element requires a start tag, an element type, and an end tag delimited by brackets (<phonenumber>). This process can be simplified and accelerated, but it can't be eliminated.

Requiring content creators - knowledge workers such as technical writers, paralegals, insurance adjusters, law enforcement personnel, research professionals, and lab technicians - to perform manual tagging on original content is an unrealistic expectation in many environments. While considerable advancements have been made in XML authoring tools, they remain unappealing to many users who prefer standard word processing or desktop publishing software. Drop-down menus for tag selection, template support, advanced scripting, macros, and other modern enhancements to XML authoring systems still require manual tagging, an activity that is wholly separate from the document creation process.

In addition to the variability of document type, the thoroughness of document representation also may vary widely, depending on the application. It's possible to categorize a document simply by title, date, subject, and author, or render a document with hundreds of variables. The wide variability in content types, content sources, and data applications requires the customization of virtually every high-volume XML publishing system in order to achieve specific business goals.

High-volume XML publishing also frequently involves legacy documents in paper or electronic form. It's not uncommon for an XML publishing project to involve warehouses of paper in bankers' boxes, thousands of pounds of microfilm and microfiche, thousands of tapes or disks in obsolete, proprietary formats, and/or literally terabytes of PDF files. In many industries the events that initiate an XML project - such as mergers/acquisitions, new document management procedures, government regulations, and/or new business initiatives - are also the events that involve the greatest volume of archival information requiring XML conversion.

Key Requirements for a Mass Conversion Platform
For the full benefits of XML to be realized for high-volume conversion of unstructured content, a production platform will increasingly use intelligent automated solutions (autotagging) to achieve productivity and quality control objectives. In some applications these automated tagging solutions can coexist with scripts and templates, but the ultimate solution should be driven by content goals. Content authors will resist using XML tools or a template in many environments; consequently, content is often generated outside an organization where conformance to standard authoring procedures is impossible.

Another requirement for high-volume XML publishing is that the DTD/Schema structure should be determined by the user or application, not by the XML editing tool, production platform, or skills of the operators. In addition to high cost and long cycle time, outsourcing XML conversion also frequently compels the use of standardized or simplified DTDs for productivity reasons, rather than the richly structured DTDs demanded by the application. Sacrificing powerful DTDs to reach cost or productivity goals is a short-term strategy that may negatively impact the overriding knowledge-management goals of the organization.

XML publishing platforms also require an integrated system that addresses all steps in the conversion process, from input through tagging, proofing, validation, and quality control. To accomplish this in an efficient manner, high-volume XML publishing requires batch processing and workflow management solutions that optimize productivity.

Automated Markup
New technologies now emerging replace manually intensive markup processes with automated tagging to enable efficient in-house conversion projects as well as support outsourcing solutions. Autotagging technology will assign data into appropriate DTD tags based on comprehensive rules or identified strings set up by content administrators. The amount of markup that can be accomplished with these new technologies will depend on the type of document converted and the resolution of the DTD structure assigned. With many types of content, over 95% of the markup can be accomplished with autotagging, enabling mass conversion projects to be accomplished efficiently without outsourcing to a service bureau.

These new automated markup tools allow content administrators to develop comprehensive rules that define, identify, and assign element tags based on user-specified DTD/Schemas. These rules can be extremely sophisticated patterns set up with strings based on key words, phrases, document location, data type (numeric, alphanumeric), or other identifiable pattern or identifier.

Representative sample documents that are fully marked up are used to identify the patterns, signifiers, and rules that indicate element tags. Drop-down menus list the possible rule components, such as key words, digits, spacing, and formats that may be grouped to form a rule.

Automated markup can also be extremely effective for converting simple DTD/Schemas into complex DTD/ Schemas and in replacing costly, cumbersome, scripted approaches to XML conversions. For highly variable content with a thorough DTD, 60-70% of the markup can be accomplished - enough to make an enormous impact on the cycle time and cost of XML publishing.

Workflow Design and Production Control
High-volume XML publishing requires workflow management solutions to achieve appropriate productivity and production control goals. For enterprise-scale processing of XML data, content administrators want to assign and distribute separate production tasks to various operators to maximize output, balance workloads, and ensure the highest quality. In addition, the workflow design must provide effective management oversight, enabling appropriate monitoring, notification, approvals, and audit trails. In most applications the production system will also require policies and procedures to ensure appropriate information security and workstation statistics for production control. To accomplish these objectives, workflow management tools need to be highly integrated into the mass conversion process from authoring and OCR scanning, through a multistep markup process and on to proofing and validation.

To achieve this high level of integration between workflow management and production tasks, XML publishing platforms require a project dispatching system from initial input (electronic document or scan), autotagging, manual tagging, proofing, validation, and posting. The system should manage this process by automatically dispatching work-in-process files to workstations based on a predetermined design established by administrators. The workflow dispatching system is integrated into the scanning, autotagging, manual tagging, and proofing/validation tools throughout, and work-in-process status can be determined at any time by authorized content managers.

Legacy document conversion often must be accommodated in mass conversion operations. Capturing accurate data from scanned documents is conducted as a preprocess module prior to XML conversion and may require varying degrees of proofing/quality control. Conversion of legacy digital file formats such as MS Office Suite files, PageMaker files, RTF files, or PDF files into XML requires less proofing/quality control process prior to XML markup and also can be seamlessly integrated into the mass conversion process.

Workflow design becomes an important factor in optimizing XML conversion. The number of workstations for any particular production process will be determined by the specific nature of the project, such as expected accuracy of the OCR process or quantity of document types (i.e., scanned or electronic documents). Workflow design using automated markup processes will be significantly different from those dependent on manual markups. In manual markup processes workflow design is typically based on DTD complexity where specialized tagging tasks are distributed to operators specifically trained in a subset of the content. Automated markup processes using autotagging are designed to eliminate these specialized tagging processes, enabling highly sophisticated markups to be accomplished through only two stages: autotagging and manual tagging for exception handling. In this way autotagging allows the workflow design to be simplified and fixed across a variety of conversion projects; the number of workstations will be dictated primarily by volume.

The combination of autotagging solutions with effective manual markups for exception handling and quality control allows content administrators to specify the depth and detail of the DTD/Schema based on the application without a loss in productivity or cycle time. To allow for breaking up the tagging process for complex documents in a production setup, advanced DTD editors provide multiple "views" of a richly structured DTD tree, further simplifying and accelerating the conversion of specialized documents.

Designing a High-Volume Conversion Process
The real-world variability in content types, content sources, volume, and in-house resources requires customized process solutions to realize efficient high-volume conversion. Only through a rigorous analysis phase and thorough process design can an XML conversion system be optimized for any particular environment. The use of consultants without commercial ties to specific vendors is often extremely useful in fully evaluating options and understanding the critical links between technology tools and human-factor organizational issues.

The process for mass conversion typically begins with a planning phase that addresses variables such as source material evaluation, target evaluation (number of targets and DTDs), volume estimates, and time frame. These issues will define the project scope, budget, and quality levels. The design phase will determine process flows, organizational requirements, infrastructure needs, workload balancing tools, and other implementation requirements.

Because the accuracy of OCR engines, organizational issues, and XML conversion quality must be thoroughly validated before full production, a proof of concept is generally a part of any comprehensive implementation. Output quality of every document source must be scrutinized in detail to eliminate errors and anomalies. In addition, productivity goals, cost estimates, and other objectives require validation.

Even with a rigorous testing phase, ongoing monitoring of quality will be required due to the variability of labor-intensive proofing and tagging steps. Conversion schedules, material trafficking, exception reporting, and delivery mechanisms need to be optimized after system rollout to achieve full benefits and efficiencies.

As XML publishing proliferates in corporations, governments, educational institutions, service bureaus, and other organizations, seamlessly integrating the markup process into content generation and authoring will be a primary objective for tool and platform developers. For cycle time and information security reasons, organizations will increasingly look toward cost-effective methods to accomplish the markup process through in-house means.

Trends in both workflow management and automated markups anticipate an expanded role for XML content administrators. Rather than specific expertise in XML syntax and semantics, content administrators will require new skill sets in organizational leadership, process design, and process management. The widespread acceptance of XML - along with advances in tools and publishing platforms - will usher in a new, expanded role for the content administrator in today's dynamic organizations.

Pattern Recognition for Automated XML Tagging
The inherent complexities and inefficiencies of manual markup tagging for mixed-structure content is being addressed through tools that use rules driven by identified strings and patterns to automate the tagging process. These tools create unique pattern files that are associated with specific elements. Elements generally have multiple rules or patterns that can be used to identify accurate tags.

Rather than writing Perl scripts, drag-and-drop is the fastest and most accurate way to associate structure points and data to a pattern. Sample documents are fully marked up and used to identify the patterns, signifiers, and rules that indicate element tags. Common expressions that may indicate relevant text and structure associated with a rule can be quickly identified using color coding, multiple windows, and other user interface techniques. Drop-down menus list the possible rule components such as key words, digits, spacing, and formats that may be grouped to form a rule.

Testing, refining, and optimizing the rules over sample documents are integral parts of the process. Processing sample documents identifies additional rules, exceptions, anomalies, and other factors that can be accounted for in the full production run. Rule libraries can be developed to allow content administrators to quickly build autotagging processes for common elements over a wider variety of document types. In this way the automated-tagging process can be optimized over time, eliminating a vast majority of the manual markup requirements for XML publishing.

Practical Examples
Using the conventions for pattern components and character mnemonics in Tables 1 and 2, a sample pattern file can be created. Table 1 summarizes the major pattern components required for effective automated markup. The elements are used to initiate and execute a pattern search. Drop-down menus facilitate rule building. Table 2 includes some of the character mnemonics and their relationship to Perl scripts.

A rule for identifying an element begins with an examination of what patterns constitute the data. Refer to the date January 1, 2002, as seen below, for an example.

The process for creating a rule based on this character set requires defining the pattern for each component (month, day, comma, year) and associating the block to the DTD. Each pattern is created through drag-and-drop by simply highlighting the component with a mouse click. The completed pattern file, composed of multiple patterns, would then look like Listing 1.

The pattern in Listing 1 would produce the following in the XML output:


Common rules, such as the one above, can be stored in rules libraries and reused for multiple document types. Many observers believe that only through such automated markup techniques can XML publishing be efficiently incorporated in high-volume content applications consisting of mixed-structure content. Further developments in the use of intelligent rule-based algorithms to automate the markup process will emerge for applications in the legal, financial, technical, medical, regulatory, and other fields involving mixed-structure content. These developments will involve both Boolean operations and other forms of artificial intelligence, such as neural networks, to produce XML markups.

More Stories By Evan Huang

Evan Huang is cofounder and chief technology officer of XMLCities,
a developer of XML
content creation,
conversion, and
publishing tools.
A frequent contributor
to various technical
journals, Evan previously held positions at SRI and Adobe Systems and taught at Notre Dame and Northwestern Polytechnic. He holds
a PhD in electrical
engineering from
Notre Dame.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.