$ antlr4 Tags.g4
$ javac Tags*.java
$ grun Tags file -tokens XML-inputs/cat.xml
[@0,0:37='<?xml version="1.0" encoding="UTF-8"?>',<3>,1:0]
[@1,38:38='\n',<5>,1:38]
[@2,39:53='<?do not care?>',<3>,2:0]
[@3,54:54='\n',<5>,2:15]
[@4,55:63='<CATALOG>',<3>,3:0]
[@5,64:64='\n',<5>,3:9]
[@6,65:79='<PLANT id="45">',<3>,4:0]
[@7,80:85='Orchid',<5>,4:15]
[@8,86:93='</PLANT>',<3>,4:21]
[@9,94:94='\n',<5>,4:29]
[@10,95:104='</CATALOG>',<3>,5:0]
[@11,105:105='\n',<5>,5:10]
[@12,106:105='<EOF>',<-1>,6:11]
This baby XML grammar properly reads in XML files and matches a sequence
of the various islands and text. What it doesn’t do is pull apart the tags and
pass the pieces to a parser so it can check the syntax.
Issuing Context-Sensitive Tokens with Lexical Modes
The text inside and outside of tags conform to different languages. For
example,
id="45"
is just lump of text outside of a tag but it’s three tokens inside
of a tag. In a sense, we want an XML lexer to match different sets of rules
depending on the context. ANTLR provides lexical modes that let lexers switch
between contexts (modes). In this section, we’ll learn to use lexical modes by
improving the baby XML grammar from the previous section so that it passes
tag components to the parser.
Lexical modes allow us to split a single lexer grammar into multiple sublexers.
The lexer can only return tokens matched by entering a rule in the current
mode. One of the most important requirements for mode switching is that
the language have clear lexical sentinels that can trigger switching back and
forth, such as left and right angle brackets. To be clear, modes rely on the
fact that the lexer doesn’t need syntactic context to distinguish between dif-
ferent regions in the input.
To keep things simple, let’s build a grammar for an XML subset where tags
contain an identifier but no attributes. We’ll use the default mode to match
the sea outside of tags and another mode to match the inside of tags. When
the lexer matches
<
in default mode, it should switch to island mode (inside
tag mode) and return a tag start token to the parser. When the inside mode
sees
>
, it should switch back to default mode and return a tag stop token.
• Click HERE to purchase this book now. discuss
Islands in the Stream • 5