About

This is a set of lexical analizers for language tokenizing. Currently there are libraries for processing JavaScript, Python, CSS, and XML/HTML with source code in JavaScript and Python 2/3.

It was primary written to address some edge case JavaScript parsing issues found in several major applications (Notepad++, Firefox, Sublime Text, Github/Ace.) These cases usually involve regular expressions or sign-prefixed numbers.

Downloads

Files named lex.* are the base classes; files named lexlang.* are the language descriptor generation files.
Example: lex.js and lexpy.js are the files needed for Python code processing running on a JavaScript interpreter

lex.js – JavaScript source, can run on both node.js and browsers
lex.py – Python source, versions 2.x or 3.x should both work
lexjs.js – JavaScript lexical analyzer (for JavaScript)
lexjs.py – JavaScript lexical analyzer (for Python)
lexpy.js – Python lexical analyzer (for JavaScript)
lexpy.py – Python lexical analyzer (for Python)
lexcss.js – CSS lexical analyzer (for JavaScript)
lexcss.py – CSS lexical analyzer (for Python)
lexxml.js – XML/HTML lexical analyzer (for JavaScript)
lexxml.py – XML/HTML lexical analyzer (for Python)

Usage

The general format for using these libraries is:

Import your language's lex.* file
Import your language's descriptor generator lexlang.* file
Create the descriptor using lexlang = lexlang.gen(lex);
Create a lex.Lexer object with the descriptor as the first argument, and the input string as the second
It is preferred that the source code is a unicode string in Python
Repeatedly call the lexer's get_token method until it returns null (or language equivalent.)

When not returning a null value, get_token will otherwise return a Token object with 4 fields:

text – the token string
type – the type constant of the token
flags – flags for the token
Many flags are primarily used internally; some are useful outside the Lexer
state – the state the token was generated in

JavaScript token type constants:

INVALID, KEYWORD, LITERAL, IDENTIFIER, NUMBER, STRING, REGEX, OPERATOR, WHITESPACE, COMMENT

Python token type constants:

INVALID, KEYWORD, LITERAL, IDENTIFIER, NUMBER, STRING, OPERATOR, WHITESPACE, COMMENT

CSS token type constants:

INVALID, WHITESPACE, COMMENT, STRING, WORD, OPERATOR, AT_RULE, SEL_TAG, SEL_CLASS, SEL_ID, SEL_PSEUDO_CLASS, SEL_PSEUDO_ELEMENT, SEL_N_EXPRESSION, NUMBER, COLOR

XML/HTML token type constants:

COMMENT, CDATA, TEXT, RAW_DATA, TAG_OPEN, TAG_CLOSE, TAG_NAME, ATTRIBUTE, ATTRIBUTE_WHITESPACE, ATTRIBUTE_OPERATOR, ATTRIBUTE_STRING

Generic token flags (that can be useful outside the `Lexer`):

flags.MEMBER, // indicates the word is a member (identifier_word.member_word)
flags.BRACKET, // this operator is a bracket of some sort
flags.BRACKET_CLOSE, // this operator is a closing bracket
... // Additional token flag constants can be found by opening the library's source

For additional help, view some of these test files, as examples are often more useful than wordy documentation.