Candle Pattern Reference

Version	: Candle 0.10
Published date	: Nov 12, 2011

1. Introduction

One unique feature of Candle is that it provides a unified pattern language that can match 3 major types of data model. The 3 types of patterns in Candle are:

Sequence Pattern: is a pattern on a sequence of items, similar to what is defined in XQuery Sequence Type.
Node Pattern: is a pattern on node hierarchy, similar to what is defined in RELAX NG.
String Pattern: is a pattern on string, similar to what is defined in RegEx and BNF.

The commonality among these 3 types of pattern is that they all share the same context-free grammar. The grammar has only a few easy-to-understand constructs: rule reference, choice, concatenation, exclusion and repetition. But when they are combined together, they can defined very complex pattern.

The differences in these 3 types of patterns are just the terminals in the grammar. The terminals in sequence patterns are items. The terminals in node patterns are nodes. And the terminals in string patterns are characters.

2. String Pattern and Grammar

Candle's grammar is self-hosted. Candle's grammar notation is closely based on EBNF.

There are two types of text grammar rules: lexical rules and syntax rules. Each rule starts with a qualified name of the rule, then followed by '=' or ':=' token, then followed by the detailed production of the rule, and finally terminated with token ';'.

Lexical rules: are grammar rules where the production of the rule matches exactly against the source text. Token '=' is used to separate the rule name and the rule production, e.g.:

Name
= ("_" | letter),
("_" | "-" | "." | letter
| digit)*;

Syntax rules: are grammar rules where whitespaces are implied between the terms of the production, and need not be explicitly specified by user. Syntax rules use token ':=' to separate the rule name and the rule production, e.g.:

Prologs
:= (NamespaceProlog | ImportProlog | ExternalRoutine
| NativeRoutine)*, 

       
(Grammar | Schema | Structure | Class
| Function | Template
| Method | GlobalVarDeclaration)*;

Below is the grammar of Candle grammar:

<?csp1.0?>



!! the grammar rules of Candle grammar

grammar candle-grammar {

root
= grammar-root;

space
= (

s
| line-comment | block-comment

)*;

grammar-root

 :=
"grammar", qname,
"{", (grammar-rule)+, "}" ;

grammar-rule

 :=
lexical-rule | syntax-rule;

lexical-rule :=

qname

,
"=", string-pattern, ";" ;

syntax-rule :=

qname, ":=", string-pattern

,
";" ;

string-pattern

 :=
(choice-pattern | concatenation-pattern
| exclusion-pattern)*;

choice-pattern

 :=
repetition-pattern, "|",

repetition-pattern

concatenation-pattern := repetition-pattern

,
",",

repetition-pattern

exclusion-pattern :=repetition-pattern

,
"-",

repetition-pattern

 :=
pattern-term, ("*" | "?" | "+")?;

pattern-term := string | rule-reference | pattern-group;

rule-reference
:= qname;

pattern-group :=
"(",

string-pattern

,
")";

!! rule productions, like

string,

line-comment,

block-comment,
are


omitted

}

If you understand EBNF or RELAX NG, the grammar above should be self-explanatory. There are several things to take note:

Candle grammar has several reserved rule names: root is top-level grammar rule where the grammar matching starts; space rule is the grammar rule used by Candle when matching syntax rules. The root rule is always required in a grammar. space rule is required if there are syntax rules in the grammar.
The grammar production is a pattern expression. There are 4 types of pattern expressions: the choice-pattern, the concatenation-pattern, the exclusion-pattern and the repetition-pattern.
Pattern group: brackets can be used to group grammar terms together. The first 3 types of pattern are intentionally defined to have same precedence, so that users are forced to put brackets '()', when different patterns are nested within each other. This should made the grammar more readable.
choice-pattern defines a set of alternative terms as the pattern. Source just need to match any term in the choice pattern.
concatenation-pattern (or list pattern as called in RELAX NG) defines a series of terms as the pattern. Source has to match all the terms in order for the entire pattern to match.
exclusion-pattern matches any source text that matches the repetition-pattern on the left side of '-' token but does not match the repetition-pattern on the right side of '-' token, e.g.:

LineComment
= "!!", (char
- ("&cr;" | "&lf;"))*;

repetition-pattern can be further divided into 3 sub-types. The three wildcard characters represent 3 types of repetition, which are widely used in RegEx:

*: repeat for 0 or more times;
?: repeat for 0 or 1 time;
+: repeat for 1 or more times;

At the leaf level of the grammar are the string terminals. There are 2 types of string terminals in Candle:

Literal String: is literal character or string wrapped with double quotes, and XML text escape sequence is used to escape the special characters;

Qname: is a reference to another rule defined in the grammar or a predefined string terminal in Candle. The predefined string terminals in Candle include:

`s`	- one or more space, tab, carriage return and line feed characters
`char`	- any single Unicode character
`letter`	- one 'a-z' \| 'A-Z' character
`letters`	- one or more 'a-z' \| 'A-Z' characters
`lower`	- one 'a-z' character
`lowers`	- one or more 'a-z' characters
`upper`	- one 'A-Z' character
`uppers`	- one or more 'A-Z' characters
`digit`	- one '0-9' character
`digits`	- one or more '0-9' characters
`hex-digit`	- one '0-9' \| 'a-f' \| 'A-F' character
`hex-digits`	- one or more '0-9' \| 'a-f' \| 'A-F' characters
`qname`	- XML QName

If some parts of the Candle language are not implemented in this release, they are marked with gray color in the grammar.

3. Node Pattern and Schema

A schema is a collection of node pattern rules.

`schema`	:=	`"schema", qname, "{", (schema-rule)+, "}" ;`
`schema-rule`	:=	`node-pattern;`
`node-pattern`	:=	`(node-choice-pattern \| node-concatenation-pattern \| node-exclusion-pattern)*;`
`node-choice-pattern`	:=	`node-repetition-pattern, "\|",` `node-repetition-pattern;`
`node-concatenation-pattern`	:=	`node-repetition-pattern, ",",` `node-repetition-pattern;`
`node-exclusion-pattern`	:=	`node-repetition-pattern, "-",` `node-repetition-pattern;`
`node-repetition-pattern`	:=	`node-pattern-term, ("*" \| "?" \| "+")?;`
`node-pattern-term`	:=	`text-term \| comment-term \| data-term \| attribute-term \|element-term \|node-rule-reference \| node-pattern-group;`
`text-term`	:=	`"text", ("{", string-pattern, "}")?;`
`comment-term`	:=	`"comment", ("{", string-pattern, "}")?;`
`data-term`	:=	`"data", ("{", sequence-pattern, "}")?;`
`attribute-term`	:=	`"attribute", (qname, ("{", sequence-pattern, "}")? )?;`
`element-term`	:=	`"element", (qname, ("{", node-pattern, "}")? )?;`
`node-rule-reference`	:=	`qname;`
`node-pattern-group`	:=	`"(", node-pattern, ")";`

Similar to text grammar. The root rule is always required, which defines the starting point of schema pattern matching.
The left side of a node pattern rule is the qualified name and the right side is its pattern production.
The node terminals can have an optional body, which matches the content of the node. Different node types have different content-pattern.

An example schema is shown below:

<?csp1.0>
namespace c = 'go:candle.style';
schema c:style-document {
    root = c:element-folio | c:element-window | c:element-scene;
    c:element-folio = element sty:folio { c:element-window | c:element-scene };
    c:element-window = element sty:window { c:block-content };
    c:element-scene = element sty:scene { c:block-content };
    c:block-content = (element sty:div { text? } | element sty:br)*;
}

4. Sequence Pattern

Sequence pattern is a pattern defined on a sequence of items.

`seq-pattern`	:=	`(seq-choice-pattern \|` `seq-concatenation-pattern \|` `seq-exclusion-pattern)*;`
`seq-choice-pattern`	:=	`seq-repetition-pattern, "\|",` `seq-repetition-pattern;`
`seq-concatenation-pattern`	:=	`seq-repetition-pattern, ",",` `seq-repetition-pattern;`
`seq-exclusion-pattern`	:=	`seq-repetition-pattern, "-",` `seq-repetition-pattern;`
`seq-repetition-pattern`	:=	`seq-pattern-term, ("*" \| "?" \| "+")?;`
`seq-pattern-term`	:=	`type-term\|` `seq-rule-reference \|` `seq-pattern-group;`
`type-term`	:=	`"empty" \| "error" \| "boolean" \| "byte" \| "ubyte" \| "short" \| "int" \| "uint" \| "long" \| "ulong" \| "float" \| "double" \| "measure" \| "datetime" \|` `string-term\| "binary" \| "qname" \| "id" \| "uri" \| "atomic" \| "sequence" \|text-term \| comment-term \| data-term \| attribute-term \|element-term;`
`string-term`	:=	`"string", ("{", string-pattern, "}")?;`
`seq-rule-reference`	:=	`qname;`
`seq-pattern-group`	:=	`"(",` `seq-pattern, ")";`

Sequence pattern is used to to match some node content (like the content of data and attribute node), and also in the match expression.

At the moment, string is the only atomic type that can have an optional body, which matches the characters of the string.

<?csp1.0?>
namespace c='';
grammar c:sample-grammar {
    root = s?, c:float | c:integer | c:uri, s?;
    c:integer = ("+" | "-")?, ("0" | ((digit - "0"), digit*));
    c:float = ("+" | "-")?, ( (digits, ".", digit* ) | (".", digits) );
    c:uri = "'", (char - "'")*, "'";
}
function main() {
{ "'uri'" match string { c:uri } } !! true
   { "uri" match string { c:uri } } !! false
   { "+3.57" match string { c:float } } !! true
   { ".57pt" match string { c:float } } !! false
}

Appendices

A. References

XML Schema;
RELAX NG;
XQuery Sequence Type;
XML DTD;
EBNF;
parsing expression grammar;
ANTLR;