抱歉,您的浏览器无法访问本站
本页面需要浏览器支持(启用)JavaScript
了解详情 >

使用TypeScript完成一个monkey语言解释器
原版书籍为用Go语言自制解释器
根据b站up主Snow-原青阳的js版学习更改而成

前言

实现了词法分析器和REPL

词法分析器

1 定义词法单元

只是一些常量和类型定义
见 token.ts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
export enum TokenType {
ILLEGAL = 'ILLEGAL',
EOF = 'EOF',

// 标识符+字面量
IDENT = 'IDENT', // add, foobar, x, y, ...
INT = 'INT', // 1343456

// 运算符
ASSIGN = '=',
PLUS = '+',

// 分隔符
COMMA = ',',
SEMICOLON = ';',

LPAREN = '(',
RPAREN = ')',
LBRACE = '{',
RBRACE = '}',

// 关键字
FUNCTION = 'FUNCTION',
LET = 'LET',
}

export class Token {
constructor(public type: TokenType, public literal: string) {}
toString(): string {
return `Token(${this.type}, ${this.literal})`;
}
}

2 单元测试函数

初步写一个测试函数,测试词法分析器是否能正确的将源代码转换为词法单元
结构很简单,定义输入的字符串和答案数组,然后遍历验证是不是正确
见 test/lexer.test.ts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import { TokenType } from '../src/token';
import { Lexer } from '../src/lexer';

function TestNextToken() {
const input = '=+(){},;';

const tests: {
expectedType: TokenType;
expectedLiteral: string;
}[] = [
{ expectedType: TokenType.ASSIGN, expectedLiteral: '=' },
{ expectedType: TokenType.PLUS, expectedLiteral: '+' },
{ expectedType: TokenType.LPAREN, expectedLiteral: '(' },
{ expectedType: TokenType.RPAREN, expectedLiteral: ')' },
{ expectedType: TokenType.LBRACE, expectedLiteral: '{' },
{ expectedType: TokenType.RBRACE, expectedLiteral: '}' },
{ expectedType: TokenType.COMMA, expectedLiteral: ',' },
{ expectedType: TokenType.SEMICOLON, expectedLiteral: ';' },
{ expectedType: TokenType.EOF, expectedLiteral: '' },
];

const l = new Lexer(input);

tests.forEach((tt, i) => {
const tok = l.nextToken();
if (tok.type !== tt.expectedType) {
console.error(`tests[${i}] - tokentype wrong. expected=${tt.expectedType}, got=${tok.type}`);
}
if (tok.literal !== tt.expectedLiteral) {
console.error(`tests[${i}] - literal wrong. expected=${tt.expectedLiteral}, got=${tok.literal}`);
}
});

console.log('所有测试用例均测试完成!');
}

// 运行测试
TestNextToken();

3 词法分析1 最基础的特殊符号

先简单实现 Lexer 的框架,就是读字符串,然后逐个转换成 token 里定义的 enum。
主要是实现基础功能,比如一些状态,下一个的函数之类的,只添加了对基础的特殊符号识别
见 lexer.ts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import { Token, TokenType } from './token';

export class Lexer {
input: string; // 输入的字符串
position: number = 0; // 所输入字符串中的当前位置(指向当前字符)
readPosition: number = 0; // 所输入字符串中的当前读取位置(指向当前字符之后的一个字符)
ch: string | undefined; // 当前正在查看的字符
constructor(input: string) {
this.input = input;
this.readChar();
}

readChar = () => {
if (this.readPosition >= this.input.length) {
this.ch = undefined;
} else {
this.ch = this.input[this.readPosition];
}
this.position = this.readPosition;
this.readPosition++;
};
nextToken = (): Token => {
let tok: Token;

switch (this.ch) {
case '=':
tok = new Token(TokenType.ASSIGN, this.ch);
break;
case ';':
tok = new Token(TokenType.SEMICOLON, this.ch);
break;
case '(':
tok = new Token(TokenType.LPAREN, this.ch);
break;
case ')':
tok = new Token(TokenType.RPAREN, this.ch);
break;
case ',':
tok = new Token(TokenType.COMMA, this.ch);
break;
case '+':
tok = new Token(TokenType.PLUS, this.ch);
break;
case '{':
tok = new Token(TokenType.LBRACE, this.ch);
break;
case '}':
tok = new Token(TokenType.RBRACE, this.ch);
break;
case undefined:
tok = new Token(TokenType.EOF, '');
break;
default:
tok = new Token(TokenType.ILLEGAL, <any>this.ch);
break;
}
this.readChar();
return tok;
};
}

然后npm run test运行测试,打印所有测试用例均测试完成!,并且前面没有任何错误打印,说明简单的Lexer完成

4 词法分析2 简单关键字和数字、空格

实现更复杂的token转换。比如多字符关键字,空格,回车等
这里先实现let fn两个关键字和数字,以及用户的自定义变量

先把测试函数里的输入和答案数组改了
见 test/lexer.test.ts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
const input = `
let five = 5;
let ten = 10;

let add = fn(x, y) {
x + y;
};
let result = add(five, ten);
`;

const tests: {
expectedType: TokenType;
expectedLiteral: string;
}[] = [
{ expectedType: TokenType.LET, expectedLiteral: 'let' },
{ expectedType: TokenType.IDENT, expectedLiteral: 'five' },
{ expectedType: TokenType.ASSIGN, expectedLiteral: '=' },
{ expectedType: TokenType.INT, expectedLiteral: '5' },
{ expectedType: TokenType.SEMICOLON, expectedLiteral: ';' },
{ expectedType: TokenType.LET, expectedLiteral: 'let' },
{ expectedType: TokenType.IDENT, expectedLiteral: 'ten' },
{ expectedType: TokenType.ASSIGN, expectedLiteral: '=' },
{ expectedType: TokenType.INT, expectedLiteral: '10' },
{ expectedType: TokenType.SEMICOLON, expectedLiteral: ';' },
{ expectedType: TokenType.LET, expectedLiteral: 'let' },
{ expectedType: TokenType.IDENT, expectedLiteral: 'add' },
{ expectedType: TokenType.ASSIGN, expectedLiteral: '=' },
{ expectedType: TokenType.FUNCTION, expectedLiteral: 'fn' },
{ expectedType: TokenType.LPAREN, expectedLiteral: '(' },
{ expectedType: TokenType.IDENT, expectedLiteral: 'x' },
{ expectedType: TokenType.COMMA, expectedLiteral: ',' },
{ expectedType: TokenType.IDENT, expectedLiteral: 'y' },
{ expectedType: TokenType.RPAREN, expectedLiteral: ')' },
{ expectedType: TokenType.LBRACE, expectedLiteral: '{' },
{ expectedType: TokenType.IDENT, expectedLiteral: 'x' },
{ expectedType: TokenType.PLUS, expectedLiteral: '+' },
{ expectedType: TokenType.IDENT, expectedLiteral: 'y' },
{ expectedType: TokenType.SEMICOLON, expectedLiteral: ';' },
{ expectedType: TokenType.RBRACE, expectedLiteral: '}' },
{ expectedType: TokenType.SEMICOLON, expectedLiteral: ';' },
{ expectedType: TokenType.LET, expectedLiteral: 'let' },
{ expectedType: TokenType.IDENT, expectedLiteral: 'result' },
{ expectedType: TokenType.ASSIGN, expectedLiteral: '=' },
{ expectedType: TokenType.IDENT, expectedLiteral: 'add' },
{ expectedType: TokenType.LPAREN, expectedLiteral: '(' },
{ expectedType: TokenType.IDENT, expectedLiteral: 'five' },
{ expectedType: TokenType.COMMA, expectedLiteral: ',' },
{ expectedType: TokenType.IDENT, expectedLiteral: 'ten' },
{ expectedType: TokenType.RPAREN, expectedLiteral: ')' },
{ expectedType: TokenType.SEMICOLON, expectedLiteral: ';' },
{ expectedType: TokenType.EOF, expectedLiteral: '' },
];

这时候肯定是通过不了测试。需要继续完善lexer
先完善token.ts,添加关键字的映射
见token.ts

1
2
3
4
5
6
7
8
9
10
11
const keywordIdentMap = new Map<string, TokenType>([
['fn', TokenType.FUNCTION],
['let', TokenType.LET],
]);

export function lookupIdent(ident: string) {
if (keywordIdentMap.has(ident)) {
return keywordIdentMap.get(ident) as TokenType;
}
return TokenType.IDENT;
}

然后扩展switch语句,用以识别关键字和标识符,数字
并且需要跳过空白字符
见lexer.ts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
nextToken = (): Token => {
let tok: Token;
this.skipWhitespace();

switch (this.ch) {
/**其他switch部分和前面一样,略过 */
default:
if (this.isLetter(this.ch)) {
const _literal = this.readIdentifier();
const _type = lookupIdent(_literal);
tok = new Token(_type, _literal);
// 这里必须提前返回,因为readIdentifier里已经调readChar了,不要再调一次
return tok;
} else if (this.isDigit(this.ch)) {
const _literal = this.readNumber();
tok = new Token(TokenType.INT, _literal);
return tok;
} else {
tok = new Token(TokenType.ILLEGAL, this.ch);
}
break;
}
this.readChar();
return tok;
};

readIdentifier = () => {
const start = this.position;
while (this.ch && this.isLetter(this.ch)) {
this.readChar();
}
return this.input.slice(start, this.position);
};

isLetter = (char: string) => /[a-zA-Z_]/.test(char);

skipWhitespace = () => {
while (this.ch && /[ \t\n\r]/.test(this.ch)) {
this.readChar();
}
};

readNumber = () => {
const start = this.position;
while (this.ch && this.isDigit(this.ch)) {
this.readChar();
}
return this.input.slice(start, this.position);
};

isDigit = (char: string) => char >= '0' && char <= '9';

readChar = () => {
if (this.readPosition >= this.input.length) {
this.ch = undefined;
} else {
this.ch = this.input[this.readPosition];
}
this.position = this.readPosition;
this.readPosition++;
};

然后npm run test运行测试
这时发现我们的简单测试用例

1
2
3
4
5
6
let five = 5; 
let ten = 10;
let add = fn(x, y) {
x + y;
};
let result = add(five, ten);

能通过了,成功识别了用户的变量以及数字、let fn关键字

4 词法分析3 剩下的全部关键字

接下来是剩下的关键字。双字符的==复杂一点,剩下的关键字true false return之类的,以及+-*/之类的运算符和前面一样

测试用例加上这几个
见 test/lexer.test.ts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
/** 前略 */
! - / * 5;
5 < 10 > 5;

if (5 < 10) {
return true;
} else {
return false;
}

10 == 10;
10 != 9;

{ expectedType: TokenType.BANG, expectedLiteral: '!' },
{ expectedType: TokenType.MINUS, expectedLiteral: '-' },
{ expectedType: TokenType.SLASH, expectedLiteral: '/' },
{ expectedType: TokenType.ASTERISK, expectedLiteral: '*' },
{ expectedType: TokenType.INT, expectedLiteral: '5' },
{ expectedType: TokenType.SEMICOLON, expectedLiteral: ';' },

{ expectedType: TokenType.INT, expectedLiteral: '5' },
{ expectedType: TokenType.LT, expectedLiteral: '<' },
{ expectedType: TokenType.INT, expectedLiteral: '10' },
{ expectedType: TokenType.GT, expectedLiteral: '>' },
{ expectedType: TokenType.INT, expectedLiteral: '5' },
{ expectedType: TokenType.SEMICOLON, expectedLiteral: ';' },

{ expectedType: TokenType.IF, expectedLiteral: 'if' },
{ expectedType: TokenType.LPAREN, expectedLiteral: '(' },
{ expectedType: TokenType.INT, expectedLiteral: '5' },
{ expectedType: TokenType.LT, expectedLiteral: '<' },
{ expectedType: TokenType.INT, expectedLiteral: '10' },
{ expectedType: TokenType.RPAREN, expectedLiteral: ')' },
{ expectedType: TokenType.LBRACE, expectedLiteral: '{' },

{ expectedType: TokenType.RETURN, expectedLiteral: 'return' },
{ expectedType: TokenType.TRUE, expectedLiteral: 'true' },
{ expectedType: TokenType.SEMICOLON, expectedLiteral: ';' },

{ expectedType: TokenType.RBRACE, expectedLiteral: '}' },
{ expectedType: TokenType.ELSE, expectedLiteral: 'else' },
{ expectedType: TokenType.LBRACE, expectedLiteral: '{' },

{ expectedType: TokenType.RETURN, expectedLiteral: 'return' },
{ expectedType: TokenType.FALSE, expectedLiteral: 'false' },
{ expectedType: TokenType.SEMICOLON, expectedLiteral: ';' },

{ expectedType: TokenType.RBRACE, expectedLiteral: '}' },

{ expectedType: TokenType.INT, expectedLiteral: '10' },
{ expectedType: TokenType.EQ, expectedLiteral: '==' },
{ expectedType: TokenType.INT, expectedLiteral: '10' },
{ expectedType: TokenType.SEMICOLON, expectedLiteral: ';' },

{ expectedType: TokenType.INT, expectedLiteral: '10' },
{ expectedType: TokenType.NOT_EQ, expectedLiteral: '!=' },
{ expectedType: TokenType.INT, expectedLiteral: '9' },
{ expectedType: TokenType.SEMICOLON, expectedLiteral: ';' },

{ expectedType: TokenType.EOF, expectedLiteral: '' },
/** 后略 */

在token.ts里添加这些关键字的映射

见token.ts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
MINUS = '-',
BANG = '!',
ASTERISK = '*',
SLASH = '/',

LT = '<',
GT = '>',

EQ = '==',
NOT_EQ = '!=',

TRUE = 'TRUE',
FALSE = 'FALSE',
IF = 'IF',
ELSE = 'ELSE',
RETURN = 'RETURN',
/** 中间略 */
const keywordIdentMap = new Map<string, TokenType>([
['fn', TokenType.FUNCTION],
['let', TokenType.LET],
['true', TokenType.TRUE],
['false', TokenType.FALSE],
['if', TokenType.IF],
['else', TokenType.ELSE],
['return', TokenType.RETURN],
]);

然后在lexer.ts的switch里简单添加上这些的识别就行了,和前面基本一样。
不一样的就是这个双字符的关键字==和!=,单独处理一些即可,用了个peekChar函数预读下一个字符
见lexer.ts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
 case '=':
// 是==的情况
if (this.peekChar() === '=') {
const _ch = this.ch;
this.readChar();
const _literal = _ch + this.ch;
tok = new Token(TokenType.EQ, _literal);
break;
}
tok = new Token(TokenType.ASSIGN, this.ch);
break;

/** 后略 */

case '-':
tok = new Token(TokenType.MINUS, this.ch);
break;
case '!':
// 是!=的情况
if (this.peekChar() === '=') {
const _ch = this.ch;
this.readChar();
const _literal = _ch + this.ch;
tok = new Token(TokenType.NOT_EQ, _literal);
break;
}
tok = new Token(TokenType.BANG, this.ch);
break;
case '*':
tok = new Token(TokenType.ASTERISK, this.ch);
break;
case '/':
tok = new Token(TokenType.SLASH, this.ch);
break;
case '<':
tok = new Token(TokenType.LT, this.ch);
break;
case '>':
tok = new Token(TokenType.GT, this.ch);
break;

/** 后略 */

// 向后预读一个字符,为的是双字符关键字比如==
peekChar = () => {
if (this.readPosition >= this.input.length) {
return;
}
return this.input[this.readPosition];
};

改完之后运行npm run test,测试没有报错均通过,完成这部分的编写

REPL

1 什么是REPL

REPL是Read-Eval-Print Loop的缩写,顾名思义就是读入一行代码,执行它,然后打印结果。
对用户来说直观来看就是一个控制台,方便用户交互用的。用户输入它就执行,输出结果
实现上来看,就是一个死循环里用node的readline模块,监听用户输入的文本,按到回车就执行,然后输出结果,开始下一轮

2 REPL实现

虽然说现在只是实现了源码转token的词法分析器,但是我们可以先实现一个简单的REPL,先把token打印出来,看看我们词法分析器是否正确
下面实现一下。
见repl.ts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import { createInterface } from 'readline';
import os from 'os';
import process from 'process';
import { Lexer } from './lexer';
import { TokenType } from './token';

const username = os.userInfo().username;
const PROMPT = '>> ';
const rl = createInterface({
input: process.stdin,
output: process.stdout,
prompt: PROMPT,
});

function startRepl() {
console.log(`Hello ${username}! This is the Monkey programming language!`);
console.log('Feel free to type in commands');
rl.prompt();

rl.on('line', (line) => {
const lexer = new Lexer(line);
let token = lexer.nextToken();
while (token.type !== TokenType.EOF) {
process.stdout.write(`${JSON.stringify(token)}\n`);
token = lexer.nextToken();
}
rl.prompt();
});

rl.on('close', () => {
process.stdout.write('\nExiting REPL.\n');
});
}
startRepl();

然后为了方便,在package.json里加上一个命令

1
2
3
"scripts": {
"repl": "npx tsx src/repl.ts"
}

运行npm run repl,打开repl。
试着输入一些monkey的代码,
比如let a = 1; fn abc(a, b) { return a + b; };
看看能不能正确的打印出token

结果是正确打印如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
npm run repl

> monkey.ts@1.0.0 repl
> npx tsx src/repl.ts

Hello WiSHao! This is the Monkey programming language!
Feel free to type in commands
>> let a = 1; fn abc(a, b) { return a + b; };
{"type":"LET","literal":"let"}
{"type":"IDENT","literal":"a"}
{"type":"=","literal":"="}
{"type":"INT","literal":"1"}
{"type":";","literal":";"}
{"type":"FUNCTION","literal":"fn"}
{"type":"IDENT","literal":"abc"}
{"type":"(","literal":"("}
{"type":"IDENT","literal":"a"}
{"type":",","literal":","}
{"type":"IDENT","literal":"b"}
{"type":")","literal":")"}
{"type":"{","literal":"{"}
{"type":"RETURN","literal":"return"}
{"type":"IDENT","literal":"a"}
{"type":"+","literal":"+"}
{"type":"IDENT","literal":"b"}
{"type":";","literal":";"}
{"type":"}","literal":"}"}
{"type":";","literal":";"}
>>
Exiting REPL.

repl至此完成

结语

这次完成了词法分析器和REPL的实现,虽然只是简单的实现了一些功能,但是已经能看出monkey语言的雏形了
接下来会继续实现语法分析器

评论