Candle
Pattern Reference
Version |
: Candle 0.10 |
Published date |
: Nov 12, 2011 |
1. Introduction
One unique feature of Candle is that it provides a unified pattern
language that can match 3 major types of data model. The 3 types of
patterns in Candle are:
- Sequence Pattern:
is a
pattern on a sequence of items, similar to what is defined in XQuery
Sequence Type.
- Node Pattern:
is a pattern
on node hierarchy, similar to what is defined in RELAX NG.
- String Pattern:
is a
pattern on string, similar to what is defined in RegEx and BNF.
The commonality among these 3 types of pattern is that they all share
the same context-free grammar. The grammar has only a few
easy-to-understand constructs: rule
reference, choice,
concatenation, exclusion and
repetition.
But when they are combined together, they can defined very complex
pattern.
The differences in these 3 types of patterns are just the terminals in
the grammar. The terminals in sequence patterns are items. The
terminals in node patterns are nodes. And the terminals in string
patterns are characters.
2.
String Pattern and Grammar
Candle's grammar is self-hosted.
Candle's
grammar notation is closely based on EBNF.
There are two types of text grammar rules: lexical rules and syntax
rules. Each rule starts with a qualified name of the rule, then
followed by '=
'
or ':=
'
token, then followed by the detailed
production of the rule, and finally terminated with token ';
'.
- Lexical rules:
are grammar
rules where the production of the rule matches exactly against the
source text. Token '
=
'
is used to separate the rule name and the rule
production, e.g.:
Name
= ("_" | letter),
("_" | "-" | "." | letter
| digit)*;
- Syntax rules:
are grammar
rules where whitespaces are implied between the terms of the
production, and need not be explicitly specified by user. Syntax rules
use token ':=' to separate the rule name and the rule production, e.g.:
Prologs
:= (NamespaceProlog | ImportProlog | ExternalRoutine
| NativeRoutine)*,
(Grammar | Schema | Structure | Class
| Function | Template
| Method | GlobalVarDeclaration)*;
Below is the grammar of Candle grammar:
<?csp1.0?>
!! the grammar rules of Candle grammar
grammar candle-grammar {
root
= grammar-root;
space
= (
s
| line-comment | block-comment
)*;
grammar-root
:=
"grammar", qname,
"{", (grammar-rule)+, "}" ;
grammar-rule
:=
lexical-rule | syntax-rule;
lexical-rule :=
qname
,
"=", string-pattern, ";" ;
syntax-rule :=
qname
, ":=",
string-pattern
,
";" ;
string-pattern
:=
(choice-pattern | concatenation-pattern
| exclusion-pattern)*;
choice-pattern
:=
repetition-pattern, "|",
repetition-pattern
;
concatenation-pattern
:=
repetition-pattern
,
",",
repetition-pattern
;
exclusion-pattern
:=
repetition-pattern
,
"-",
repetition-pattern
;
repetition-pattern
:=
pattern-term, ("*" | "?" | "+")?;
pattern-term := string | rule-reference | pattern-group;
rule-reference
:= qname;
pattern-group :=
"(",
string-pattern
,
")";
!! rule productions, like
string,
line-comment,
block-comment,
are
omitted
}
If you understand EBNF or RELAX NG, the grammar above should be
self-explanatory. There are several things to take note:
- Candle grammar has several
reserved rule names:
root
is top-level grammar rule where the grammar matching starts; space
rule is the grammar rule used by Candle when matching syntax rules. The
root
rule is always required in a grammar. space
rule is required if there are syntax rules in the grammar.
- The grammar production is
a pattern expression. There are 4 types of pattern
expressions: the
choice-pattern
,
the concatenation-pattern
,
the exclusion-pattern
and
the repetition-pattern
.
- Pattern group: brackets can
be used to group grammar terms together. The first 3 types of
pattern
are intentionally defined to have same precedence, so that users are
forced to put brackets '()', when different patterns are nested within
each other. This should made the grammar more readable.
- choice-pattern
defines a set of alternative terms as the pattern. Source just need to
match any term in the choice pattern.
- concatenation-pattern
(or list pattern as called in RELAX NG) defines a series of terms as
the pattern. Source has to match all the terms in order for the entire
pattern to match.
- exclusion-pattern
matches any
source text that matches the
repetition-pattern
on the left side of
'-' token but does not match the repetition-pattern
on the
right side of '-' token, e.g.:
LineComment
= "!!", (char
- ("&cr;" | "&lf;"))*;
- repetition-pattern
can be further divided into 3 sub-types. The three wildcard characters
represent 3 types of repetition, which are widely used in RegEx:
*
:
repeat for 0 or more
times;
?
:
repeat for 0 or 1 time;
+
:
repeat for 1 or more
times;
- At the leaf level of the
grammar are the string terminals. There are 2 types of string terminals
in Candle:
- Literal String: is literal
character or string wrapped with double quotes, and XML text escape
sequence is used to escape the special characters;
- Qname: is a reference to
another rule defined in the grammar or a predefined string terminal in
Candle. The predefined string terminals in Candle include:
s |
- one or more
space, tab,
carriage
return and line feed characters |
char |
- any single
Unicode
character |
letter |
- one 'a-z' |
'A-Z' character |
letters |
- one or more
'a-z' | 'A-Z' characters |
lower |
- one 'a-z'
character |
lowers |
- one or more
'a-z' characters |
upper |
- one 'A-Z'
character |
uppers |
- one or more
'A-Z' characters |
digit |
- one '0-9'
character |
digits |
- one or more
'0-9' characters |
hex-digit |
- one '0-9' |
'a-f' | 'A-F' character |
hex-digits |
- one or more
'0-9' | 'a-f' | 'A-F' characters |
qname |
- XML QName |
If some parts of the Candle language are not implemented in this
release, they are marked with gray
color
in the grammar.
3.
Node Pattern and Schema
A schema is a collection of node pattern rules.
schema |
:= |
"schema", qname,
"{", (schema-rule)+, "}" ; |
schema-rule |
:= |
node-pattern; |
node-pattern |
:= |
(node-choice-pattern
| node-concatenation-pattern
| node-exclusion-pattern)*; |
node- choice-pattern
|
:= |
node-repetition-pattern,
"|", node- repetition-pattern ; |
node- concatenation-pattern |
:= |
node- repetition-pattern ,
",", node- repetition-pattern ; |
node- exclusion-pattern |
:= |
node- repetition-pattern ,
"-", node- repetition-pattern ; |
node- repetition-pattern |
:= |
node-pattern-term,
("*" | "?" | "+")?; |
node-pattern-term |
:= |
text-term
| comment-term | data-term | attribute-term | element-term
| node-rule-reference
| node-pattern-group; |
text-term |
:= |
"text",
("{", string-pattern, "}")?; |
comment-term |
:= |
"comment",
("{", string-pattern, "}")?; |
data-term |
:= |
"data",
("{", sequence-pattern, "}")?; |
attribute-term |
:= |
"attribute",
(qname,
("{", sequence-pattern, "}")? )?; |
element-term |
:= |
"element",
(qname,
("{", node-pattern, "}")? )?; |
node-rule-reference |
:= |
qname; |
node-pattern-group |
:= |
"(",
node-pattern,
")"; |
- Similar to text grammar. The
root
rule is always required, which defines the starting point of schema
pattern matching.
- The left side of a node
pattern rule is the qualified name and the right side is its pattern
production.
- The node terminals can have
an optional body, which matches the content of the node. Different node
types have different content-pattern.
An example schema is shown below:
<?csp1.0>
namespace c = 'go:candle.style';
schema c:style-document {
root = c:element-folio |
c:element-window | c:element-scene;
c:element-folio = element sty:folio {
c:element-window | c:element-scene };
c:element-window = element sty:window {
c:block-content };
c:element-scene = element sty:scene {
c:block-content };
c:block-content = (element sty:div {
text?
} | element sty:br)*;
}
4. Sequence Pattern
Sequence pattern is a pattern defined on a sequence of items.
seq-pattern |
:= |
( seq -choice-pattern
| seq -concatenation-pattern
| seq -exclusion-pattern)*; |
seq - choice-pattern
|
:= |
seq -repetition-pattern,
"|", seq - repetition-pattern ; |
seq - concatenation-pattern |
:= |
seq - repetition-pattern ,
",", seq - repetition-pattern ; |
seq - exclusion-pattern |
:= |
seq - repetition-pattern ,
"-", seq - repetition-pattern ; |
seq - repetition-pattern |
:= |
seq -pattern-term,
("*" | "?" | "+")?; |
seq -pattern-term |
:= |
type-term
| seq -rule-reference
| seq -pattern-group; |
type-term |
:= |
"empty"
| "error" | "boolean" | "byte" | "ubyte" | "short" | "int" | "uint" |
"long" | "ulong" | "float" | "double" | "measure" | "datetime"
| string-term
| "binary" | "qname" | "id" | "uri" | "atomic" | "sequence" | text-term
| comment-term | data-term | attribute-term | element-term; |
string-term |
:= |
"string",
("{", string-pattern, "}")?; |
seq -rule-reference |
:= |
qname; |
seq -pattern-group |
:= |
"(", seq -pattern,
")"; |
Sequence pattern is used to to match some node content (like the
content of data and attribute node), and also in the match expression.
At the moment, string is the only atomic type that can have an optional
body, which matches the characters of the string.
<?csp1.0?>
namespace c='';
grammar c:sample-grammar {
root
= s?, c:float | c:integer | c:uri,
s?;
c:integer = ("+" | "-")?, ("0" | ((digit
- "0"), digit*));
c:float = ("+" | "-")?, ( (digits,
".",
digit*
) | (".", digits)
);
c:uri = "'", (char
- "'")*, "'";
}
function main() {
{ "'uri'" match string { c:uri } } !! true
{ "uri" match string { c:uri } } !! false
{ "+3.57" match string { c:float } } !!
true
{ ".57pt" match string { c:float } } !!
false
}
Appendices
A. References