Tuesday, March 10, 2009

Lloyd on Schematron and validation

Lloyd McKenzie wrote:
Hi Adam,

The philosophy the MIF has taken thus far is as follows:
- If you can put a rule in the schema, do that. Schema rules can be enforced by editors on the fly, used for type-ahead, used for code-generation, etc. They're also easier to write and to get right.
- If you can't put a rule in schema, put it in as an x-path expression in schematron. At least those can be easily invoked on XML instances
- If you can't express the rule in an X-path (e.g. URLs must resolve to a live link, no recursion allowed in code system hierarchies, etc.), then document it as a comment for software developers to implement in code.

We can use schema as well as XSLT/Schematron in the publication validation process, and I can't see any reason why we wouldn't always do both. Even if the schema wasn't customized, we'd still want to do schema validation to confirm that it's legal XHTML. Moving all validation into Schematron is crazy. You can enforce element order, nesting and most type definitions far easier with the schematron language than you can with XPath.

I recognize that "as you type" validation is hard. I'm not sure it's fair to say it's impossible. However, it may be more expensive than is reasonable to implement. That said, we want to get as close to "as you type" validation as we can. Anything that's easy to customize in whatever WYSIWYG XHTML editor that allows us to more closely approximate the rules in the MIF, we should do. For the rest, I agree that a "validate" or "verify" button is reasonable. It would also be good if you couldn't actually close/leave the window or apply changes so long as there's an outstanding validation issue. However, I'm not clear on why the validation process couldn't (or shouldn't) use all three validation approaches - schema, XSLT and code.

Schema validation can be performed in an Ant script. Changing the schema isn't an issue. MIF schemas change regularly. Though the markup schema is unlikely to change much if at all (either the schema, schematron or programatic rules). I recognize that when schemas change, this may mean changing software. However, when anything in the MIF changes, it may mean changing software. How the rules happen to be encoded shouldn't impact whether software change is necessary or desirable. I'll admit that w3c schema validation messages aren't as human readable as schematron rules, but that's a small price to pay for direct support in XML validation tools, code generation, type-ahead, etc. And schema validation can be (and currently is) applied in batch just like XSLT validation.

I'm not arguing against external validation. We currently have, and will continue to use, external validation during the publication process for various reasons including those you mentioned (external references that can't be known at design time, custom tweaking by users outside of the UI, tooling bugs, changing rules, etc.). However, we also want as much validation as humanly possible to be done at design time. Every error that's caught within the authoring process is an error we don't have to deal with in the 5 days between final content submission and ballot opening.


On some of your other points:
- We specify maximum lengths on everything so that from a user-interface, data storage and software design perspective, we know what may be enountered and can write code accordingly.
- Id references are for things like referring to tables, figures, graphics, etc. from elsewhere in the fragment. And our source materials don't currently use GUIDs. Seeing as some of the source materials are hand-authored, they shouldn't be required to use GUIDs. And because multiple XHTML fragments from multiple sources may be combined in a publication instance, we can only ensure uniqueness within the fragment, not a whole document.
- In the cut and paste scenario, the "accept what they pasted and report it in the error report" is definitely the least desirable of the options. We don't want users submitting invalid content. We don't have the bandwidth to fix content on their behalf. If they decide to use a different authoring tool and copy and paste, then they have the responsibility for getting that content clean. It is not the responsibility of HL7 HQ or the publishing process. In the event that bad content does sneak through, it's just going to get kicked back to the author anyhow. May as well get it right the first time.


Lloyd
March 10th 2009

No comments: