minidb-01-lexer
This post documents the first step of my “write a relational database from scratch” journey: the lexer. The lexer turns raw SQL text into a stream of tokens the parser can consume.
Why a Lexer?
Before parsing SQL syntax (SELECT…FROM…WHERE…), we need to normalize raw text into tokens:
- classify words like
SELECT(keywords) vs.price(identifier), - detect literals like
'Alice'or12.34, - handle punctuation like
(,),,,;,*,+,-,/.
A clean token stream makes the parser much simpler and safer.
Token Model
Token kinds
public enum TokenKind {
KEYWORD,
IDENTITY,
STRING,
NUMBER,
OPEN_PAREN, CLOSE_PAREN, COMMA, SEMICOLON,
ASTERISK, PLUS, MINUS, SLASH
}
- KEYWORD:
SELECT,FROM,WHERE, … - IDENTITY: table / column names (starts with a letter; letters, digits,
_thereafter) - STRING: single-quoted literal, e.g.,
'hello' - NUMBER: integer or decimal, e.g.,
123,45.67 - Symbols:
(),;*+-/
Keywords
A fixed set mapped case-insensitively:
public enum Keyword {
Create, Table,
Insert, Into,
Select, From,
Where, Null, Default,
And, Or, Not,
True, False, Boolean, Bool,
Float, Double, Integer, Int, String, Text, Varchar,
Values, Primary, Key;
}
Implementation detail:
- A static, unmodifiable
LOOKUPmap stores upper-cased strings →Keyword. Keyword.fromString(String)returnsnullif the token is not a keyword (then it’s an identifier).
Token object
public static class Token {
private TokenKind kind;
private String text; // IDENT/STRING/NUMBER/symbol literal
private Keyword keyword; // for KEYWORD only
// factory methods: keyword(...), identity(...), string(...), number(...), symbol(...)
}
equals/hashCodeimplemented → stable in tests & collections.toString()helpful for debugging (e.g.,KEYWORD(Select),NUMBER(12.3)).
Lexing Rules
Whitespace
- Skipped (
space,\t,\n,\r).
Strings
- Format:
'...' - Captures everything until the next single quote.
- No escape support yet (
'It\'s'is not handled).
Numbers
- Integers and fixed-point decimals:
123or12.34. - No sign handling here (
-is a separateMINUStoken). - No scientific notation (e.g.,
1e5) yet.
Identifiers & Keywords
- Identifiers: start with a letter; then letters/digits/underscore.
- The sliced lexeme is upper-cased and checked in
Keyword.LOOKUP.- Hit →
KEYWORD. - Miss →
IDENTITY.
- Hit →
Symbols
- Supported:
(),;*+-/
End of Input
hasNext()caches the next token; setsfinishedwhen none remains.
Errors
- Unexpected character →
ParseException("[Lexer] Unexpected character: ...") - Unterminated string →
ParseException("[Lexer] Unexpected end of string")
Implementation Highlights
- Single pass over
char[]with indexi. - Peek + Next API:
hasNext()scans & caches one token.peek()returns the cached token without consuming.nextToken()consumes and returns.
- Test helper:
tokenize()collects all tokens into a list—perfect for assertions.
Key private helpers:
private void skipWhiteSpace()
private boolean eof()
private boolean isWhiteSpace(char c)
private boolean isDigit(char c)
private boolean isLetter(char c)
private boolean isSymbol(char c)
Example
Input SQL
SELECT name, age FROM users WHERE age > 18;
Output Tokens (conceptual)
KEYWORD(Select)
IDENTITY(name)
COMMA(,)
IDENTITY(age)
KEYWORD(From)
IDENTITY(users)
KEYWORD(Where)
IDENTITY(age)
[> not yet supported — see “Limitations”]
NUMBER(18)
SEMICOLON(;)
Note: Comparison operators like >/>= aren’t implemented yet.
Quick Usage
Lexer lex = new Lexer("INSERT INTO t (a, b) VALUES (1, 'x');");
while (lex.hasNext()) {
System.out.println(lex.nextToken());
}
Sample output:
KEYWORD(Insert)
KEYWORD(Into)
IDENTITY(t)
OPEN_PAREN(())
IDENTITY(a)
COMMA(,)
IDENTITY(b)
CLOSE_PAREN())
KEYWORD(Values)
OPEN_PAREN(())
NUMBER(1)
COMMA(,)
STRING(x)
CLOSE_PAREN())
SEMICOLON(;)
Design Choices
- Factory methods for
Tokencreation: fewer mistakes, clearer intent. - Peekable iterator: parser can look ahead without consuming.
- Minimal symbol set: faster to get a working parser; add more when needed.
Appendix: Full Source
Lexer.java
package com.ziyingdeng.minidb.parser;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.NoSuchElementException;
import java.util.Objects;
public class Lexer {
public enum Keyword {
Create, Table,
Insert, Into,
Select, From,
Where, Null, Default,
And, Or, Not,
True, False, Boolean, Bool,
Float, Double, Integer, Int, String, Text, Varchar,
Values, Primary, Key;
private static final Map<String, Keyword> LOOKUP;
static {
Map<String, Keyword> m = new HashMap<>();
m.put("CREATE", Create); m.put("TABLE", Table);
m.put("INSERT", Insert); m.put("INTO", Into);
m.put("SELECT", Select); m.put("FROM", From);
m.put("WHERE", Where); m.put("AND", And);
m.put("OR", Or); m.put("NOT", Not);
m.put("NULL", Null); m.put("DEFAULT", Default);
m.put("TRUE", True); m.put("FALSE", False);
m.put("BOOLEAN", Boolean); m.put("BOOL", Bool);
m.put("FLOAT", Float); m.put("DOUBLE", Double);
m.put("INTEGER", Integer); m.put("INT", Int);
m.put("STRING", String); m.put("TEXT", Text);
m.put("VARCHAR", Varchar); m.put("VALUES", Values);
m.put("PRIMARY", Primary); m.put("KEY", Key);
LOOKUP = Collections.unmodifiableMap(m);
}
// Convert String to Keyword
public static Keyword fromString(String str) {
return LOOKUP.get(str.toUpperCase());
}
}
// Five kinds of tokens:
// 1. Keyword: Select, table, create...
// 2. Identity: column name, row name...
// 3. String: tsring literal
// 4. Number: integer, double, boolean...
// 5. Symbol: ";", "*" ...
public enum TokenKind {
KEYWORD,
IDENTITY,
STRING,
NUMBER,
OPEN_PAREN, CLOSE_PAREN, COMMA, SEMICOLON,
ASTERISK, PLUS, MINUS, SLASH
}
public static class Token {
private TokenKind kind;
private String text; // IDENT/STRING/NUMBER/symbol
private Keyword keyword; // KEYWORD
private Token(TokenKind tk, String t, Keyword kw) { this.kind = tk; this.text = t; this.keyword = kw; }
public static Token keyword(Keyword kw) { return new Token(TokenKind.KEYWORD, kw.name(), kw); }
public static Token identity(String s) { return new Token(TokenKind.IDENTITY, s, null); }
public static Token string(String s) { return new Token(TokenKind.STRING, s, null); }
public static Token number(String s) { return new Token(TokenKind.NUMBER, s, null); }
public static Token symbol(TokenKind k, String literal) { return new Token(k, literal, null); }
public boolean equals(Object o) {
if (this == o) return true;
if (! (o instanceof Token t)) return false;
return kind == t.kind
&& Objects.equals(text, t.text)
&& keyword == t.keyword;
}
public int hashCode() {
return Objects.hash(kind, text, keyword);
}
public String toString() {
return kind == TokenKind.KEYWORD ? "KEYWORD(" + keyword + ")" :
kind + "(" + text + ")";
}
}
// Lexer code
private char[] s; // consume sql text char by char
private int i; // record position
private int n; // length of string
private boolean finished; // if we reach the end of input
private Token cachedToken; // saved next token for peek()
public Lexer(String sqlText) {
this.s = sqlText.toCharArray();
this.i = 0;
this.n = sqlText.length();
}
// Scan and check if there is the next token
// -- if true, scan one token and store in cache, advance pointer
// -- if false, set finisehd = true
public boolean hasNext() {
if (finished) return false;
if (cachedToken != null) return true;
cachedToken = scanToken();
if (cachedToken == null) {
finished = true;
return false;
}
return true;
}
// Consume and return the token in cached
public Token nextToken() {
if (!hasNext()) throw new NoSuchElementException("No more tokens");
Token t = cachedToken;
cachedToken = null; // clean
return t;
}
// Peek the next token but doesn't consume
public Token peek() {
return hasNext()? cachedToken : null;
}
// For test: collect all the tokens
List<Token> tokenize() {
List<Token> out = new ArrayList<>();
while (hasNext()) {
out.add(nextToken());
}
return out;
}
// Scan and return one token from the char array
private Token scanToken() {
// skip all white space before each token
skipWhiteSpace();
if (eof()) return null;
char c = s[i];
if (c == '\'') return scanString();
if (isDigit(c)) return scanNumber();
if (isLetter(c)) return scanIdentityKeyword();
if (isSymbol(c)) return scanSymbol(c);
throw new ParseException("[Lexer] Unexpected character: '" + c + "'");
}
// String format: '...'
private Token scanString() {
i++; // skip the beginning '\''
int start = i;
while (!eof()) {
char c = s[i++];
if (c == '\'') {
return Token.string(new String(s, start, (i - 1) - start)); // skip the ending '\''
}
}
throw new ParseException("[Lexer] Unexpected end of string");
}
// Number format: 1234 and 12.34
private Token scanNumber() {
int start = i;
while (!eof() && isDigit(s[i])) { i++; }
if (!eof() && s[i] == '.') { // handle float
i++;
while (!eof() && isDigit(s[i])) { i++; }
}
return Token.number(new String(s, start, i - start));
}
// Identity format: must start with letter, followed by letter/number/'_'
// Keyword format: LOOK UP table
private Token scanIdentityKeyword() {
int start = i;
i++; // first letter checked
while (!eof()) {
char c = s[i];
if (isLetter(c) || isDigit(c) || c == '_') { i++; }
else break;
}
String t = new String(s, start, i - start);
Keyword kw = Keyword.fromString(t.toUpperCase());
return (kw == null) ? Token.identity(t) : Token.keyword(kw);
}
// Suppported symbols: ( ) , ; * + - /
private Token scanSymbol(char c) {
i++; // consume
return switch (c) {
case '(' -> Token.symbol(TokenKind.OPEN_PAREN, "(");
case ')' -> Token.symbol(TokenKind.CLOSE_PAREN, ")");
case ',' -> Token.symbol(TokenKind.COMMA, ",");
case ';' -> Token.symbol(TokenKind.SEMICOLON, ";");
case '*' -> Token.symbol(TokenKind.ASTERISK, "*");
case '+' -> Token.symbol(TokenKind.PLUS, "+");
case '-' -> Token.symbol(TokenKind.MINUS, "-");
case '/' -> Token.symbol(TokenKind.SLASH, "/");
default -> throw new ParseException("[Lexer] Unknown symbol: '" + c + "'");
};
}
private void skipWhiteSpace() { while (!eof() && isWhiteSpace(s[i])) { i++; }}
private boolean eof() { return i >= n; }
private boolean isWhiteSpace(char c) { return c == ' ' || c == '\t' || c == '\n' || c == '\r'; }
private boolean isDigit(char c) { return c >= '0' && c <= '9'; }
private boolean isLetter(char c) { return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'); }
private boolean isSymbol(char c) {
return c == '(' || c == ')' || c == ',' || c == ';' || c == '*' || c == '+' || c == '-' || c == '/';
}
}