My adventures with Python, functional programming, Korean, Test Driven Development and more

2010 Stanley Cup predictions

2010-04-12, by Dan Bravender <dan.bravender@gmail.com>

 1   SJ                             WSH  1
         SJ                     WSH
 8  COL                             MTL  8
            DET             WSH
 4  PHX                             PIT  4
        DET                     PIT
 5  DET                             OTT  5
                DET     WSH
 2  CHI                             NJ   2
        CHI         DET         NJ
 7  NSH                             PHI  7
            CHI             BUF
 3  VAN                             BUF  3
        VAN                     BUF
 6  LAK                             BOS  6

Some Assembly Required

2010-04-01, by Dan Bravender <dan.bravender@gmail.com>

Tired of slow web frameworks?

Does your website need to scale? Are you sick of adding new servers to handle the load of your extremely popular website? Do you have a frozen feature set and no desire to maintain your code? If so, Some Assembly Required is the scaling solution you've been looking for.

Upgrading from a scripting language
  1. Manually convert your program from the slow dynamic language it is currently written in to an nginx assembly module.
  2. Reconfigure nginx to serve the module.
  3. Sit back and watch as your iron-level optimizations improve the speed of your site by several orders of magnitude (or less).
  4. Fight all requests for new features and bug fixing with this simple phrase: "Are you kidding me? It's written in assembly! It will take months to add that new feature or track down that bug."
  5. Relax.
Example

Here's a hello world example.

The example was based on code from George Malamidis.

Productive PHP

2010-03-29, by Dan Bravender <dan.bravender@gmail.com>

I have personally said some pretty harsh things about PHP over the years. It's hard to say nice things about a language which doesn't have namespaces (apparently they are going to be introduced in PHP 6). This has led to a standard library that eats up the global namespace. On my box at home there are over 1,345 functions defined globally when I run get_defined_functions(). The number of defined functions depends on which modules you have installed. That said, I have started using PHP again and I'm determined to make it a more pleasant experience this time. I have discovered several tools that make working in PHP less painful and I'd like to share them.

One of the things that really bugs me about PHP is that it doesn't tell me what led up to an error. In Python you get a traceback every time an exception is raised. The good news is that if you install the xdebug module you can get nice tracebacks of PHP errors. Problem solved.

When developing I need to be able to use a REPL (Read Eval Print Loop) so I can make sure my assumptions about how the language works are correct. PHP doesn't come with a REPL but fortunately the guys at Facebook made phpsh, which is a great tool. It says a lot that the people at Facebook needed a tool like this since they have to put up with PHP every day. After installing it you can load any PHP files and then run arbitrary PHP commands and see the output. It also pulls up documentation from a database and lets you reload external PHP files. I cannot understand how people develop in PHP without it.

Last but not least is testing. This is where some of PHP's design flaws start to show up. I am a proponent of Test Driven Development which is very hard to do in PHP. Why? Because in PHP you cannot redefine global functions. This means that any call to mysql_query is going to be a call to whoever defines mysql_query first. When refactoring code that doesn't use OOP it can be very hard to ensure the code is doing what you expect except through manual testing. I banged my head against this over the weekend and I came up with GlobalMock, a way to make it so PHP can do late binding of global calls. It does require that you slightly modify your code. Here is a simple piece of code before:

<?
mysql_connect('localhost', 'root', '');
mysql_select_db('test');

function get_uid($username) {
    $query = sprintf("select uid from users where username='%s' limit 0,1",
                     mysql_real_escape_string($username));
    $result = mysql_query($query);
    if (!$result) {
        return false;
    }
    list($uid) = mysql_fetch_array($result);
    return $uid;
}

And here is what it looks like after applying GlobalMock so runtime binding is possible:

<?
require_once('../global_mock.php');
$gm = GlobalMock::getInstance();

$gm->mysql_connect('localhost', 'root', '');
$gm->mysql_select_db('test');

function get_uid($username) {
    $gm = GlobalMock::getInstance();
    $query = sprintf("select uid from users where username='%s' limit 0,1",
                     $gm->mysql_real_escape_string($username));
    $result = $gm->mysql_query($query);
    if (!$result) {
        return false;
    }
    list($uid) = $gm->mysql_fetch_array($result);
    return $uid;
}

Now actual unit tests can be written:

<?
require_once('../test/simpletest/autorun.php');
require_once('../global_mock.php');

class TestUser extends UnitTestCase {
    function testInitialization() {
        $gm = GlobalMock::getInstance();
        $gm->testing();
        $gm->add_expected('mysql_connect',
                          new GlobalMockIgnore(),
                          true);
        $gm->add_expected('mysql_select_db',
                          array('test'),
                          true);
        require_once('user_testable.php');
    }

    function testGetUid() {
        $gm = GlobalMock::getInstance();
        $gm->testing();
        $gm->add_expected('mysql_real_escape_string',
                          array('angelo_luis_martin'),
                          'angelo_luis_martin');
        $gm->add_expected('mysql_query',
                          array("select uid from users where username='angelo_luis_martin' limit 0,1"),
                          'results_of_search');
        $gm->add_expected('mysql_fetch_array',
                          array('results_of_search'),
                          array('1'));
        $this->assertEqual(get_uid('angelo_luis_martin'), '1');
    }
}

If $gm->testing() isn't called the arguments will be passed to the globally defined function. I hope this helps make PHP easier to test and therefore less painful to use. If you want to give GlobalMock a spin the source is up for grabs on GitHub.

WebSocket Games in Python with Tornado

2010-01-15, by Dan Bravender <dan.bravender@gmail.com>

I've implemented my favorite board and card games in many different technologies over the years. Several years ago I tried using a cutting-edge technology at the time named Comet, which is supposed to be a pun on AJAX. Since I chose Comet I got bogged down in testing on different browsers and wrote hack upon hack to make everything work. Comet is really just a hack to make bidirectional communication between browsers and servers work. This is tough to do since HTTP is a stateless protocol. When I recently read about WebSockets I knew the time had come to see if I could clean up my projects and actually focus on the game logic instead of worrying about how messages were being passed back and forth.

WebSockets won't be available in browsers until the next generation, but there is a wonderful Flash-based bridge written by Hiroshi Ichikawa called web-socket-js that lets you use WebSockets in almost any browser that supports Flash. It's a great project since everything just works out of the box.

The web server that I chose to implement the games in is Tornado. Tornado is a high performance web server written in Python. I went with Tornado because one of its authors, Bret Taylor, recently added support for WebSockets. WebSockets are a subclass of RequestHandler so they can be mounted at any path and are accessible at the same port as your Tornado server. This should make it easier to run WebSockets in production.

Creating a WebSocket object on the client is as simple as can be:

ws = new Websocket('ws://localhost:9999/ws');

Once you have a WebSocket object, you can add callbacks for when you receive messages and you can send messages:

ws.send('hello server'); // Sends "hello server" to the server

ws.onmessage = function(msg) {
    // Alerts any message received on the WebSocket
    alert(msg.data);
}

It's probably best to wrap all of this inside of an onopen block in case the WebSocket doesn't open it's connection immediately. This would cause the send command to fail.

ws.onopen = function() {
    ws.send('hello server');

    ws.onmessage = function(msg) {
        alert(msg.data);
    };
};

In my projects, I tried to keep the protocol as simple as possible. I think I succeeded in my mission. In the templates, player moves are defined like so:

onclick="ws.send('{{ player.remember(player.play_card, card) }}');"

player.remember returns the id of a callback that is stored on the server. In this example the callback is equivalent to calling player.play_card(card). Anything that follows the callback id on the wire is passed to the callback as the keyword argument 'message'. The callback that allows players to rename themselves is constructed like this:

<input id="name" value="{{ player.name }}" onchange="ws.send('{{ player.remember(player.rename) }} ' + $('#name').val())"/>

This greatly reduces the need to perform complex serialization and deserialization. I could've used JSON on both ends but even then I'd still have to look up items in a similar way to get to the actual live Python objects on the server. Using callbacks means adding a new action doesn't require too many modifications on either end.

Here's how the callback mechanism works:

def remember(self, callback, *args, **kwargs):
    def doit(message):
        kwargs['message'] = message
        callback(*args, **kwargs)
    cid = str(id(doit))
    self.callbacks[cid] = doit
    return cid

When the server receives a message it asks the player to decode the message. The player object does this by looking up the reference in its callback dictionary. Once the player decodes the message it tells the game object that the player wants to perform an action. The game workflow is tremendously simple as well. The current state is stored as a string. For example, in Kaibosh, if we are in the bidding phase, game.state is 'bid'. I chose to use strings instead of references to methods because I found that Python would never garbage collect an object that referred to its own methods. I created a simple decorator which is declared above each state method. The decorator ensures that the player is playing on their turn and that the parameter is of the appropriate type. This is how play_card is defined in the kaibosh module:

@message(Card)
def play_card(self, player, card):
    ...

I tried to keep the client as stupid as possible so I could keep state in one place: the server. I had originally planed on making it so that there was one supernode client who controlled the games. Then I realized that I'm a Python developer and I really don't want to mess around too much with Javascript. Plus, it would be really hard to deter cheating if all the information was stored in one client. But, I digress. Every time the client receives a message from the server it pulls in the table and player's hand with an XMLHttpRequest call. This helps keep the logic all in one place. jQuery is another technology that I can't live without anymore. Loading chunks of HTML is ridiculously simple with jQuery:

$('#hand').load('http://' + location.hostname + ':' + location.port + location.pathname + '/hand');

It's possible that for some games you'll need to queue up actions and store some state on the client. I did this for Set, since three cards have to be selected for the client to send a message to the server. I tried to keep the state in one place and one place only. When a card is clicked this toggles whether or not it has the "selected" class. When three cards have this class they are sent to the server. When someone else discovers a set while you have cards selected the play area will remove some cards. Since some of the cards that are removed might have been selected, every time a new table is loaded the cards that were saved in the selected_cards array have the "selected" class applied to them. jQuery makes this easier than it should be:

var selected_cards = [];

function check_set() {
    selected_cards = [];
    $('.selected').each(function() {
        selected_cards.push($(this)[0].id);
    });

    if (selected_cards.length == 3) {
        ws.send(selected_cards.join(' '));
    }
}

function check_table() {
    for (i in selected_cards) {
        if (document.getElementById(selected_cards[i])) {
            $('#' + selected_cards[i]).addClass('selected');
        }
    }
}

function update_table() {
    $('#table').load(
        'http://' + location.hostname + ':' + location.port + location.pathname + '/table', 
        function (responseText, textStatus, XMLHttpRequest) {
            $('.card').bind('click', function() {
                if ($('.selected').length < 3 || $(this).hasClass('selected')) {
                    $(this).toggleClass('selected');
                    check_set();
                }
            });
            check_table();
        }
    );
}

WebSockets have made creating games that run in the browser a much easier task than it was before. I'm sure that people will find many new uses for WebSockets. It's great that we can start using it today through web-socket-js and Tornado.

Here is a picture of WebSocket Kaibosh (a Euchre variant which my family plays quite often):

The rules for Kaibosh are here.

Here is a picture of WebSocket Set:

You can play Set here.

The source code for the games I've implemented can be found at GitHub:

Metalytics - The Analytics of Analytics

2009-12-21, by Dan Bravender <dan.bravender@gmail.com>

Which analytics services are most widely used? I wondered this a couple years ago and whipped up a script to check. The script checks 1000 random URLs and looks through the HTML for some strings that are known to be used by certain analytics engines.

Google Analytics is Winning

After the last run it looks like Google Analytics is well ahead of the other analytics packages. The script checks for Clicky Web Analytics, Coremetrics, HitsLink, Webtrends, Omniture, Piwik, StatCounter, W3Counter and Site Meter. Here is a graph from the last run:

Originally, when I wrote the script I was only checking for Google Analytics so I have some data points from as far back as November, 2007. Here's a graph of the percentage of websites that use Google Analytics:

Dongsa.net Irregular verb logic update

2009-12-05, by Dan Bravender <dan.bravender@gmail.com>

In the past 6 months dongsa.net has been used by 3412 unique visitors from 59 countries. As of today it has conjugated 8394 verbs. It's great to see so many people all over the world learning Korean.

Last night I updated the way that irregulars are handled and tried to make it more obvious when a verb is regular or irregular. If you ask for the conjugation of 쉽다 the site will now tell you that you are looking at a ㅂ irregular at the top of the page. The rare verbs that have both regular and irregular forms, such as 걷다 now display a message about this at the top of the page.

The whole irregular detection system was gutted and replaced with a much simpler system. The previous system used heuristics to determine if a verb was regular or irregular. There were a lot of rules, they weren't always right, and they were getting very hard to maintain. The solution was to extract verbs that look irregular but aren't from Bryan Park's spreadsheet of Korean verbs based on his excellent book 500 Korean Verbs. A list of exceptions was created of verbs that met the criteria of an irregular form but conjugated regularly.

For example, 웃다 ends in a ㅅ, so it might be an irregular ㅅ verb. Both the regular and irregular conjugations are tested to see which one is correct. It turns out that the regular conjugation is correct, so 웃다 is added to the list of verbs that looks like an irregular but is indeed regular.

This commit has a list of all of the verbs that appeared to meet the criteria for being irregular but are regular.

If you are interested in dongsa.net the entry announcing the site might be of interest. Also, the code is available on github for your perusal.

MapReduce for Go

2009-11-24, by Dan Bravender <dan.bravender@gmail.com>

Google's new language, Go, isn't as revolutionary as its designers would have you believe but it is an interesting language nonetheless. It is a systems language that has features that address the multicore trend. The future is parallel computation and who better to introduce a systems language with first-class concurrency primitives than the company that popularized MapReduce? Speaking of MapReduce, here's a simple implementation I wrote for Go (available at github):

I don't claim that this is an optimized MapReduce implementation. For each item on the input queue a new channel is created. After all the channels are created it loops over the channels and blocks waiting for results in FIFO order. Responses are then placed on the reducer's queue. The interface{}s that are littered throughout the code are there because Go doesn't currently support generics. All types implement the empty interface (interface{}) so it acts like an Object in Java or a void * pointer in C. To get the real value you have to unbox it like so: x.(int). I was a little disappointed that you have to sacrifice type safety in order to be able to write a generic function.

Here is a sample program that uses MapReduce to count the number of words in all the files that can be found from the current directory:

One major idiom that I picked up from the standard library that comes with Go is the iteration idiom which creates a channel, writes data to it and then closes it. You can see this in the find_files function. This way, you can iterate over the results of the channel using range which you can see in the outer for loop in the reducer. Note that you have to populate the channel with data in a goroutine or else the application will reach deadlock since writing to a channel blocks until the data you write is consumed by default. To change this behavior you can create the channel with a buffer like this: make(chan int, 10). Writing to a buffered channel won't block until its buffer is full. Still, if you plan on using the iteration idiom it is best to spawn off a goroutine since you can populate the channel with as many items as necessary without leaving a chance of deadlock.

I also noticed that range is inconsistent. If you use it on a string or an array it will return two values, the first value being the position in the list. When used with a channel it returns only the next item. You would expect that this would be consistent across all types but for some reason it's not:

Go is a neat language. It's the first compiled language that I've used in a long time and it feels like a scripting language, which is a compliment. It's unfortunate that one has to sacrifice type safety in order to make generic functions. There isn't much that's new in Go, but the syntax and the way that things are put together works well. Languages with concurrency primitives are the future (even though some of them are 20 years old and the theory behind them is 30 years old).

Purely Functional Programming for Python

2009-08-23, by Dan Bravender <dan.bravender@gmail.com>

For the past couple of years I have been playing around with Functional Programming (FP) but I still do a lot of my "real work" in imperative OOP in Python. There are pros and cons to each approach to solving problems.

Since some people might not be familiar with FP so I will try to explain it briefly. Basically, in a purely functional language, functions are not allowed to produce side-effects. Printing, modifying global state, modifying variables that have already been assigned a value and many, many techniques that define imperative programming are expressly banned in FP. The end result of this is that functions are referentially transparent, which is to say that you will get the same result from the function every time that you pass it the same parameters. Also, the parameters themselves can never be modified. This is the exact opposite of imperative programming. If you pass a variable by reference or if you pass a function or method an object it can be modified therein. For a more thorough introduction to FP see defmacro's excellent article Functional Programming for the Rest of Us.

Imperative OOP code is easy to understand and maintain. It jives with our understanding of reality: you say Printer.print(document) and it does exactly that. It can also run faster and use less memory than FP, since you can change values in-place. When you want to make a modification in a pure function you have to copy everything and change only the piece of data that needs to be updated. That can be slow. Many FP compiler implementers have been working on optimizations to work around this so things have gotten a lot better.

Testing in OOP can sometimes be a pain since you may have to potentially set up a massive amount of state. Objects can rely on the state of other objects, etc. Since pure functions are referentially transparent testing them just requires that you pass in some parameters and check that the return value is what you expect. There are other benefits to be had from pure functions as well.

The following was basically plagiarized from Wikipedia's article about FP:

  1. If the result of a pure function isn't used it doesn't need to be computed.
  2. If a pure function is called with the same parameters the compiler can implement caching optimizations such as memoization.
  3. If there is no data dependency between two pure expressions (the result of one isn't required as an argument to the other), their order can be reversed or they can be run in parallel.
  4. If there are no side-effects, any evaluation strategy can be used.

You might've noticed that CPUs have basically stopped getting faster recently. When we buy a computer now we just get more CPUs: dual-core, quad-core and so on according to Moore's law. This trend is going to continue to produce more and more cores.

This is bad news for the standard imperative approach to parallelization, threading, which relies on shared state. In OOP, objects are widely used. Calling a method on an object most likely modifies its state. Other threads might access the same object so it they have to check if something else is modifying the object that they want to update. If it is being updated the thread requesting the update has to block. This doesn't scale well. Pure functions share nothing and therefore don't block. Because of this more and more massively parallel systems are being built with functional languages or principles: Google uses map/reduce for searching its massive database. Facebook uses Erlang for its chat system. Amazon uses Erlang for their distributed database SimpleDB.

The other day I was explaining some of the benefits of functional programming to a coworker and it dawned on me that I might be able to explain to Python 1) how to check if a function is pure and 2) how to optimize the function according to the aforementioned rules about pure functions by manipulating the function's Abstract Syntax Tree using Python's ast module.

The result of a couple night's work is "Purely Functional Programming for Python" (pfpp), a very nascent library that checks if a function is pure and then tries optimization 3 from the above list.

I tried to make functional a decorator, so it could appear above functions, but this made it hard to get the AST, since the hackish way that I am getting the function source code isn't available when the decorator is called. I hope to solve this soon, but for now, if you want to declare something a pure function you have do something like this:

def a():
    return 10
a = functional(a)

This will check to make sure that the function is pure and then perform parallelization optimizations. Currently it checks to see that no no printing is done, variables are not assigned to more than once and no methods of objects are called. If any of these are violated the program will stop at the parse step. This is really cool. It's like being able to make your own language with Python as a starting point. And Python is a great starting point IMHO. However, for this project a lot of work remains here. There are probably many ways to get around these checks and introduce impure code.

Once it gets past the purity check, function calls within the function are parallelized. Here is an example:

from time import sleep

def x():
    sleep(2) # to simulate a long process
    return 10

def y():
    sleep(2) # to simulate a long process
    return 20

def z():
    x_result = x()
    y_result = y()
    return x_result + y_result

print z()

Given the above program, before parallelization:

$ time python demo.py
30

real    0m4.094s
user    0m0.020s
sys 0m0.052s

After adding one line, the z function's AST is transformed so that x and y are run in parallel:

z = functional(z)

$ time python demo.py 
results manager invoked
30

real    0m2.169s
user    0m0.044s
sys 0m0.088s

What's really cool about this is that, much like Haskell, if Python does end up getting a reliable way of identifying pure functions we can implement most of our programs in pure functions, automatically gaining benefits from them, and still use a few impure functions to communicate with the "real" world. Since Python lacks the insane type system of Haskell it might be a lot harder to enforce this separation but it's worth a shot.

Under the hood, pfpp requires Python 2.6 for the latest and greatest ast helpers for metaprogramming and the multiprocessing library for parallelization. Please feel free to contribute via github or shoot me an email if you are interested in this project. Memoization could be the next low hanging fruit. A lot of work remains.

UPDATE

Memoization has been added. Things like computing the Fibonacci series recursively can be made a lot faster now:

def fib(x):
    if x == 0 or x == 1:
        return 1
    return fib(x - 1) + fib(x - 2)

print fib(30)

As it is written above:

$ time python demo.py 
1346269

real    0m0.943s
user    0m0.512s
sys 0m0.092s

After setting fib = functional(fib):

$ time python demo.py 
1346269

real    0m0.188s
user    0m0.060s
sys 0m0.104s

5 times faster through memoization! While performing these benchmarks I also removed a bottleneck concerning the ResultManager, which stores the results of parallelized function calls. It is basically just a wrapper around multiprocessing.pool that turns results into thunks so they can be evaluated in parallel. Anyway, a ResultsManager was being created with each function call which was a lot of overhead. It is now only created once and placed in the func_globals of the function. Should speed things up a lot.

Sandboxed Python Environments

2009-07-26, by Dan Bravender <dan.bravender@gmail.com>

After reading Tools of the Modern Python Hacker: Virtualenv, Fabric and Pip I updated my current Python project to take advantage of some of the techniques described. Some of the most important methodologies that developers can employ are isolation and automated duplication. Virtualenv makes it easy to isolate and run different programs using different Python and library versions on the same machine. When it comes to duplication, pip makes it a snap for a developer to recreate another developer's frozen environmnent. I'll just briefly explain how I've used these tools.

Pip and virtualenv, the two tools that make Python time travel possible, were written by Ian Bicking and there's more information available on his blog post about pip requirements. They make it very easy to specify the exact library versions a project requires. From inside your project directory, it's as easy as:

easy_install virtualenv pip
virtualenv . --no-site-packages
pip -E . install [libraries your project requires, such as ipython]
# For example, when sandboxing my blog, I ran `pip -E . install weblog`
source bin/activate # this puts your shell into the sandbox
# make sure your program works here 
pip -E . freeze > requirements.txt

--no-site-packages makes the environment a sandbox so it doesn't rely on any packages already installed on your system. I highly recommend always using this setting to get the full benefit of virtualenv. "-E ." tells pip to use the environment in the current working directory. You can add the requirements.txt to version control. It's also a good idea to tell your version control system to ignore bin, docs, src, include, man and lib directories that are created by virtualenv.

The generated requirements file looks like this for my blog:

Jinja2==2.1.1
markdown2==1.0.1.13
weblog==2.1
wsgiref==0.1.2

And it looks like this for dongsa.net:

CherryPy==3.1.2
Jinja2==2.1.1
coverage==3.0.1
-e git://github.com/dbravender/gp_through_unit_tests.git@7b2625470f08e033cd0c562129ce8592da08eadf#egg=gp_through_unit_tests-0.1-py2.6-7b2625470f08e033cd0c562129ce8592da08eadf
nose==0.11.1
wsgiref==0.1.2

As you can see, pip has support for downloading and installing packages from version control systems such as git.

Later, the sandbox can be recreated like so:

virtualenv . --no-site-packages
pip -E . install -r requirements.txt

There are still some rough spots when it comes to pip and virtualenv. One noticeable problem that Jesse Noller pointed out is that the interpreter path is hardcoded at the top of every file so you can't easily move these environments around. This is pretty easy to work around, just reinstall the environment or write a simple sed script (like Jesse mentions) to update the hardcoded paths.

I'm going to get busy updating my other Python projects (especially my work projects) so other developers can easily recreate a working environment and not waste any time getting cryptic errors because I forgot to mention a particular library in the README.

Announcing dongsa.net

2009-06-28, by Dan Bravender <dan.bravender@gmail.com>

I have been working on a program that, when given an infinitive, can conjugate Korean verbs. You can see it in action at dongsa.net (Thanks for the great domain name recommendation Sangwhan!). My original attempt in Erlang never panned out (but it was good functional programming practice). I restarted a couple months ago using Python (a language that I'm much more familiar with) and it's now very close to passing a database of 115812 conjugations.

According to ohloh.net it would've cost $18,542 to develop this software commercially.

Features

  1. 르 irregular stem change [모르 -> 몰라]
  2. vowel contraction [ㅏ + ㅏ -> ㅏ] (몰라 + 았 -> 몰랐)
  3. join (몰랐 + 어 -> 몰랐어)
  4. join (몰랐어 + 요 -> 몰랐어요)

Testing

Since it deals with something as murky as a natural language it has been a rough ride getting this program to handle all the verbs in Korean. That said, I am sure that the core is pretty solid because I developed it by writing the tests first (a method called Test Driven Development). I have the book "500 Korean Verbs" written by Bryan Park and I made a bunch of unit tests using the data I extracted from the spreadsheet that someone made from the book:

$ ./test
.....................................................................
.....................................................................
.....................................................................
.....................................................................
.....................................................................
.....................................................................
.....................................................................
.................................................
Name                   Stmts   Exec  Cover   Missing
----------------------------------------------------
hangeul_utils             50     50   100%   
index                     33     26    78%   26-27, 46-50, 53-54
korean_conjugator        294    293    99%   298
korean_pronunciation      87     87   100%   
korean_stemmer            23     20    86%   14, 16, 18
----------------------------------------------------
TOTAL                    487    476    97%   
----------------------------------------------------------------------
Ran 532 tests in 3.113s

OK

I made heavy use of nose's ability to create tests from generator functions. In the pronunciation test I yield from a long list of pronunciation samples:

for x, y in [(u'국물',       u'궁물'),
             ...
             (u'앉아',       u'안자'),
             (u'잃어버리다', u'이러버리다'),
             (u'앉는',       u'안는'),
             (u'닮다',       u'담다'),
             (u'닮아',       u'달마'),
             (u'못하다',     u'모타다'),
             (u'학교',       u'학꾜'),
             (u'손이',       u'소니'),
             (u'산에',       u'사네'),
             (u'돈을',       u'도늘'),
             (u'문으로',     u'무느로'),
             (u'좋은',       u'조은')
             ...]:
   yield check_pronunciation, x, y

And here's an example from the conjugation tests:

def test_model_verbs():
    yield check, declarative_present_informal_low, u'기다', u'겨'
    yield check, declarative_past_informal_low, u'기다', u'겼어'
    yield check, declarative_future_informal_low, u'기다', u'길거야'
    yield check, inquisitive_present_informal_low, u'기다', u'겨'
    yield check, inquisitive_past_informal_low, u'기다', u'겼어'
    yield check, propositive_present_informal_low, u'기다', u'겨'
    yield check, declarative_present_informal_high, u'기다', u'겨요'
    yield check, declarative_past_informal_high, u'기다', u'겼어요'
    yield check, declarative_present_informal_low, u'기다', u'겨'
    yield check, declarative_past_informal_low, u'기다', u'겼어'
    yield check, declarative_future_informal_low, u'기다', u'길거야'
    yield check, inquisitive_present_informal_low, u'기다', u'겨'

On top of doing unit testing I also semi-regularly perform a complete check of a database I extracted from the spreadsheet mentioned above. The latest run was 114382 correct conjugations / 115812 total conjugations (98.77% accurate). Some of the data is incorrect. Sometimes there are verbs that are not contracted (but the conjugation algorithm currently only handles contractions). And of course, sometimes the algorithm is incorrect. I'm going to make the database test smarter and that will mark if a verb has been conjugated properly in a particular tense at least once to weed out the verbs that have not been contracted in the database. If a conjugation fails it is printed as an assert so I can put it in the tests. Here are the results from the last run. Currently it prints out the infinitive and what is expected. I should add what dongsa.net currently returns as well to make it so a human tester can more easily determine what is happening.

Lessons learned

I learned a lot about both Korean verbs and programming through this project. Here are the highlights:

기다1869
지다1144
가깝다976
적다729
남다683
가다503
내다464
만들다456
부르다177
살다163
주다163
외우다140
보다137
까맣다119
애쓰다119
바르다115
뛰다110
되다108
걷다90
하다90
오다75
잇다66
서다63
담그다32
쓰다24
눕다23
그러다15
푸르다14
켜다12
낫다4
누르다4
깨닫다2
돕다1
아니다1
이다1
푸다1
infinitive 낫다
declarative present informal low 나아    
1. ㅅ irregular (낫 -> 나 [hidden padchim])
2. join (나 + 아 -> 나아)

This is implemented by subclassing unicode and keeping a flag if the last character has a hidden padchim. When a slice is taken from the character the flag is only transferred if the slice includes the last character.

base                                 돕
base2                                도오
base3                                도우
declarative present informal low     도와
imperative present informal high     도우세요

As you can see, base3 is used in the imperative present informal high but base2 is used in the declarative present informal low conjugation. For most verbs base2 and base3 are the same.

일다     -- 일어 일러
곱다     -- 고와 곱아
파묻다   -- 파묻어 파물어
누르다   -- 눌러 누래
묻다     -- 물어 묻어
이르다   -- 일러 이르러
되묻다   -- 되물어 되묻어
썰다     -- 썰어 써려
붓다     -- 부숴 부어 부수어
들까불다 -- 들까불러 들까불어
굽다     -- 굽어 구워
걷다     -- 걷어 걸어
뒤까불다 -- 뒤까불러 뒤까불어
이다     -- 이야 여
def find_vowel_to_append(string):
    for character in reversed(string):
        if character in [u'뜨', u'쓰', u'트']:
            return u'어'
        if vowel(character) == u'ㅡ' and not padchim(character):
            continue
        elif vowel(character) in [u'ㅗ', u'ㅏ', u'ㅑ']:
            return u'아'
        else:
            return u'어'
    return u'어' 


>>> from hangeul_utils import find_vowel_to_append
>>> print find_vowel_to_append(u'크')
어
>>> print find_vowel_to_append(u'먹')
어
>>> print find_vowel_to_append(u'알')
아
>>> print find_vowel_to_append(u'아프')
아

As you can see, most of the time it is sufficient to just look at the last vowel. However, if there is a null vowel at the end, as is the case with 아프다, you have to look at the character before that to see if it has a non-neutral vowel. If it does, that's your vowel to append, so 아프 -> 아파. In the case of 크다 there is no previous, so it gets the default ㅓ. Verbs that end with the vowels ㅗ, ㅏ, and ㅑ all get ㅏ appended when conjugated.

Pronunciation

The pronunciation rules are a single pass, but they are all evaluated. The original string can be modified by several different rules. Here's an example:

먹었 [머걷]
# modified by change_padchim_pronunciation(changers=(u'ᆺ', u'ᆻ', u'ᆽ', u'ᆾ', u'ᇀ', u'ᇂ'), to=u'ᆮ')
먹었다 [머걷 + 다] -> [머거따]
# modified by change_padchim_pronunciation(changers=(u'ᆺ', u'ᆻ', u'ᆽ', u'ᆾ', u'ᇀ', u'ᇂ'), to=u'ᆮ')
# then modified by consonant_combination_rule(u'ᆮ', u'ᄃ', None, u'ᄄ')

Stemming

Stemming was not an original goal of the project (there isn't even a way to get stems using the web interface). But, I had an epiphany while I was writing the conjugator and it was quick to implement it in the brute force way that I imagined. I guess that real Korean stemmers that have to be efficient either have a massive lookup table or a much more efficient algorithm than my stemmer employs.

The strategy used in the stem function is pretty simple. Start from the left, take a character at a time and build a list of all conjugations that come from that character plus the preceding characters in the verb. If the conjugation doesn't appear in all the conjugations for the stem, it adds another character and continues.

>>> print korean_stemmer.stem(u'안녕하세요', verbose=True)
으
안
앋
압
알
앗
안
아
안니
안느
안녕
안녇
안녑
안녈
안녓
안년
안녀
안녕흐
안녕하
Matches conjugation imperative present informal high of the verb 안녕하다

You might be wondering why there are so many strange stems that the stemmer tries (like "으"). Sometimes information is lost when a verb is conjugated (e.g. 짓다 -> 지). Other verbs go through destructive stem changes (e.g. 걷다 -> 걸어). In order to catch irregular verbs the function modifies each character that is appended with all the possible ways that stem-changing irregulars can change form. This is why the stemmer as it is implemented is so inefficient.

>>> print korean_stemmer.stem(u'지어', verbose=True)
즈
지
짇
집
질
짓
Matches conjugation propositive present informal low of the verb 짓다
>>> print korean_stemmer.stem(u'걸으세요', verbose=True)
그
걸
걷
Matches conjugation imperative present informal high of the verb 걷다

The source can be had at github. It's written in Python. Should work in anything >= 2.5 and < 3.0.

If this floats your boat you might want to check out Hanjadic, a tool that I wrote to build my Korean vocabulary by learning the Korean pronunciation of Chinese characters.

Also, please be sure to check out Matt Strum's Hangeul Assistant. It's great to see other people working on stuff like this!