A Javascript lexer in Python, and the saga behind it

Sunday 10 April 2011This is almost 14 years old. Be careful.

In the last week I’ve written a new Javascript lexer, jslex. Why I did it is one of those open source adventures that starts innocently enough.

I’m working on a Django project for a client, and it needs to be localized into their language. Django has good support for localization, providing tools for extracting strings from Python, HTML, and Javascript files. But something wasn’t right: the client reported that some of the strings were still in English. Usually this means that they made a small mistake during the translation process, and the English in the source doesn’t match the English in the message file.

But when I looked, it turned out the English was completely missing from the message file. Check the source: yup, it’s properly marked for translation. Then I remembered: parsing Javascript source files for messages is fragile. I’d encountered this before, and had simply fiddled with the Javascript source to make the problem go away. But this time, as one message was re-harvested, other messages would disappear. The problem seemed more severe than I had encountered in the past. I decided to learn more about why it was happening.

Like many open source projects, Django uses Gnu gettext to manage the message files, including using the xgettext tool to parse the source files to find strings to translate. But xgettext doesn’t support parsing Javascript. Django has a strange accommodation to deal with this: it performs a simple transformation on the Javascript source, then tells xgettext that it’s Perl.

I can only guess why Perl was chosen: because Javascript and Perl both have regex literals, which as we’ll see, play a large part in this story. But Django’s Javascript-to-Perl transformation is simplistic: it just converts all //-comments on their own line into #-comments. So this Javascript:

// My awesome Javascript
x = 1;  // Don't start x at zero.
gettext("Please translate me!");

gets transformed into this “Perl”:

# My awesome Javascript
x = 1;  // Don't start x at zero.
gettext("Please translate me!");

I assume the reason //-comments that share a line with code are skipped is to avoid clobbering strings with // in them, though with multi-line strings, even that is not enough to protect them.

Of course, this transformation is insufficient to properly carry the strings into the “Perl” so that xgettext can find them. For example, in the above sample, the Javascript comment on line 2 is still executable Perl code after the transformation, and the apostrophe in the comment is considered the start of a string literal, so the gettext call is skipped as part of a multi-line string.

In fact, depending on the version of gettext, which determines how advanced its Perl parsing is, all sorts of innocuous Javascript constructs can throw off the parser:

gettext("Message on 1");
var x = y;
gettext("Message on 3");
gettext("Message on 4");
gettext("Message on 5");

Here messages 1 and 5 are found, and 3 and 4 are not. How come? Because Perl’s y operator consumes two strings delimited by the next character, in this case a semicolon, so lines 3 and 4 are considered literals rather than code.

In truth, Django’s accommodation for Javascript is an egregious hack. So I wanted to find a better solution. I figured that if I could properly lex Javascript, then I could manipulate the token stream to create something that could reliably be parsed by gettext.

The result is jslex, a pure-Python lexer for Javascript. Lexing Javascript turns out to be tricky due to our old friend the regex literal. When a slash character is found, it could mean one of four things: a division operator (either / or /=), a line comment (//), a multi-line comment (/*), or a regex literal. The two comment forms are simple to deal with, because a regex literal can’t be empty, so // is always a comment, and a regex can’t start with a star, so /* is always a comment.

But distinguishing between division and regexes is impossible to do at a purely lexical level, and can be quite subtle:

for (var x = a in foo && "</x>" || mot ? z:/x:3;x<5;y</g/i) {xyz(x++);}
for (var x = a in foo && "</x>" || mot ?  z/x:3;x<5;y</g/i) {xyz(x++);}

The first line has a regex of /x:3;x<5;y</g, the second has /g/i.

The ECMAScript standard says you need to parse the code, and if you’re at a point where a regex literal would be a valid next token, then lex it as a regex, but if you’re at a point where a division would be valid, that lex it as division.

I wasn’t willing to write a full parser, but I’ve taken a similar approach to other light Javascript tools, and use the previous token to decide if the next token can be division or regex. It seems to work well.

The lexer is a general-purpose multi-state lexer built on regular expressions. The rules create a two-state lexer with a state for “division possible,” and “regex possible.” When I thought I had it working, I outsourced the QA to Stack Overflow, finally finding something to do with my too-many reputation points: pay a bounty to find Javascript it doesn’t lex properly. Mind-twistingly, a respondent there found a useful test: a Javascript lexer written in Javascript, which when fed through my lexer, failed because my regex-matching regex couldn’t properly lex his regex-matching regex!

To bridge Javascript code to xgettext, I chose to transform it into “C” instead of Perl. That means getting rid of the regex literals by turning them all into the C string “REGEX”, and changing single-quoted strings into double-quoted strings.

The next phase is to determine whether this gets into Django or not. I’ve prepared it as a patch, but there was already some momentum to replace gettext with Babel, and it’s looking like it might all have to wait for 1.4 in any case. As someone who’s recently lost time to this bug, I would really rather get something into 1.3.1, so we’ll see where that ends up.

In any case, if you have need for lexing Javascript in Python, use jslex, it works.

Comments

[gravatar]
You can already use Babel in place of xgettext to extract messages and manage the message catalogs. Babel includes extractors for Python and Javascript code, and an extractor for Django templates is included in the BabelDjango extension.

You just need to install Babel and BabelDjango, create a small config file to tell it how to extract messages from which kinds of files (see Extraction Method Mapping and Configuration) and then run the corresponding pybabel commands instead of Django's makemessages (and optionally compilemessages).
[gravatar]
I could use it for me html rendering engine ;)
[gravatar]
Thanks for JsLex, Ned - I used it to write a simple tool for stripping console.* calls out of JavaScript source.
https://github.com/davidqhogan/nocons
[gravatar]
I started a new project involving Python and Javascript, though you might be interested
[gravatar]
Are you trying to suggest that my letting him know I found a great use for his library is somehow inappropriate? If so, I don't understand why.
[gravatar]
@David, I think abki's comment can look sarcastic, but is not. He's linked to a bitbucket repo containing a project of his.
[gravatar]
@David, Ned is right, I'm not sarcastic at all, but my comment is poorly edited, here is the link to the project http://goo.gl/gT1FD
[gravatar]
@abki my apologies! It didn't make sense to me but the timing and my not realising that the name linked to the project conspired to confuse me into thinking I was being told off :P

Python in the browser .. a long held dream of mine. Awesome project.
[gravatar]
This thread was bumped in my message box because of a SPAM. Here is an update about the suject of Python in the browser:

- I tried to create a rendering engine, type settings and layout resolution is hell. This could be made easier with a constraint resolution engine with cassowary algorithm

- I tried to create a Python -> Javascript tranlator several times, now called Pythonium. It kind of prooves that it possible for Python (Clojure has full browser support). My own benchmarks tells me that it doesn't proove useful to fully translate Python (integer / float, __getattribute__, __getattr__, metaclass (replaced by class decorators)) otherwise said the version that is the most compliant is very slow. Actually 30 times slower. One might say "Python doesn't care". I'm not Python, I care.

Also maybe, a full overhaule of the web browser techs wouln't be more interesting for a Python experience.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.