Lexical Analysis Libraries
Simple lexical analysis libraries for JavaScript and Python
view on
github

About

This is a set of lexical analizers for language tokenizing. Currently there are libraries for processing JavaScript, Python, CSS, and XML/HTML with source code in JavaScript and Python 2/3.

It was primary written to address some edge case JavaScript parsing issues found in several major applications (Notepad++, Firefox, Sublime Text, Github/Ace.) These cases usually involve regular expressions or sign-prefixed numbers.

Downloads

Files named lex.* are the base classes; files named lexlang.* are the language descriptor generation files.
Example: lex.js and lexpy.js are the files needed for Python code processing running on a JavaScript interpreter

Usage

The general format for using these libraries is:

  1. Import your language's lex.* file
  2. Import your language's descriptor generator lexlang.* file
  3. Create the descriptor using lexlang = lexlang.gen(lex);
  4. Create a lex.Lexer object with the descriptor as the first argument, and the input string as the second
    It is preferred that the source code is a unicode string in Python
  5. Repeatedly call the lexer's get_token method until it returns null (or language equivalent.)

When not returning a null value, get_token will otherwise return a Token object with 4 fields:

JavaScript token type constants:
INVALID, KEYWORD, LITERAL, IDENTIFIER, NUMBER, STRING, REGEX, OPERATOR, WHITESPACE, COMMENT

Python token type constants:
INVALID, KEYWORD, LITERAL, IDENTIFIER, NUMBER, STRING, OPERATOR, WHITESPACE, COMMENT

CSS token type constants:
INVALID, WHITESPACE, COMMENT, STRING, WORD, OPERATOR, AT_RULE, SEL_TAG, SEL_CLASS, SEL_ID, SEL_PSEUDO_CLASS, SEL_PSEUDO_ELEMENT, SEL_N_EXPRESSION, NUMBER, COLOR

XML/HTML token type constants:
COMMENT, CDATA, TEXT, RAW_DATA, TAG_OPEN, TAG_CLOSE, TAG_NAME, ATTRIBUTE, ATTRIBUTE_WHITESPACE, ATTRIBUTE_OPERATOR, ATTRIBUTE_STRING

Generic token flags (that can be useful outside the Lexer):
flags.MEMBER, // indicates the word is a member (identifier_word.member_word)
flags.BRACKET, // this operator is a bracket of some sort
flags.BRACKET_CLOSE, // this operator is a closing bracket
... // Additional token flag constants can be found by opening the library's source

For additional help, view some of these test files, as examples are often more useful than wordy documentation.

Demo

A few reference images are included for test1.js for comparison of highlighting: