My thoughts around an advanced parser is to do it in several stages:
1. A basic tokeniser that will take wordlists of each type and turn them into a basic bytecode
2. A basic bytecode interpreter that will form the verb and noun phrases of each statement in the input and conjunctons
3. These are then passed to the separate commands for them to handle.
A sort of example with the phrase GET THE SWORD, FISH AND RED KEY. THEN KILL THE ORC QUICKLY WITH THE SWORD.
Step 1, we tokenise into bytecode:
Code:
GET = verb 01 = 01 01
THE = fluff, remove
SWORD = noun 04 = 02 04
, = conjunction = 03 01
FISH = noun 06 = 02 06
AND = conjunction = 03 03
RED = adjective = 04 03
KEY = noun = 02 08
. = conjunction = 03 02
THEN = conjunction = 03 04
KILL = verb = 01 04
THE = fluff, remove
ORC = noun = 02 01
QUICKLY = adverb = 05 01
WITH = dative = 06 01
THE = fluff, remove
SWORD = noun = 02 04
Leading to a byte code stream of:
Code:
01 01 02 04 03 01 02 06 03 03 04 03 02 08 03 02 03 04 01 04 02 01 05 01 06 01 02 04
Which is in sets of 2 bytes; the first being type, the second the specific example.
Then we can use some simple logic to work out the noun and verb phrases, so working through the stream we have:
verb, noun, conjunction, noun, conjunction, adjective, noun, conjunction, conjunction, verb, noun, adverb, dative, noun
We can apply the rules:
- A conjunction followed by a conjunction is equal to one conjunction.
- A conjunction surrounded by nouns groups nouns controlled by the same verb
- A conjunction with a verb immediately before or after it is a new sentence
So we can reduce it to the following sentences:
verb, noun, noun, adjective, noun
verb, noun, adverb, dative, noun
The we can go through it again splitting verb and noun phrases, so we have:
1st sentence
VP = GET
NP = SWORD, FISH, RED KEY
2nd sentence
VP = KILL QUICKLY
NP = ORC
DP = SWORD
Which can be directly passed to the command interpreter.
I've been trying to hack up some proof of concept code to show this is a lot easier than it sounds, but it's been 20 years since I've coded in BBC BASIC and it's turning out to be harder than I thought (too many good habits)!