My adventures with Python, functional programming, Korean, Test Driven Development and more

When Failure is the Best Option

2011-11-22, by Dan Bravender <dan.bravender@gmail.com>

In Python (and most sane scripting languages) when something unexpected happens an exception is raised and execution stops. Damien Katz calls this the "Get the Hell out of Dodge" error handling method in his seminal Error codes or Exceptions? Why is Reliable Software so Hard?. In his article Damien explains several different ways of handling errors. None of the options is to ignore that something went wrong. That's because ignoring problems only makes them worse. But that's exactly what PHP and MySQL do for certain classes of errors.

Here's how Python handles failure:

% python
>>> print a
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'a' is not defined
>>>

PHP's default behavior is to just keep chugging along ignoring problems that could cause huge issues:

 % php 2> >(while read line; do echo -e "stderr> $line"; done)
<?
printf("%d\n", $a);
?>
0
stderr> PHP Notice:  Undefined variable: a in - on line 2
stderr> PHP Stack trace:
stderr> PHP   1. {main}() -:0

A "notice", eh? Really? If you try to delete a record from a database using sprintf to ensure it is a decimal and accidentally pass in an undefined variable as the id PHP will happily tell the database to delete the record with the id of "0". In my opinion this deserves more than a "notice" in the logs. PHP's default error-handling behavior is a recipe for disaster.

Fortunately, if you must use PHP, there is a way to make PHP behave in a more sane manner and force every unexpected event to raise an exception (exception_error_handler from http://www.php.net/manual/en/class.errorexception.php):

 % php 2> >(while read line; do echo -e "stderr> $line"; done)
<?
function exception_error_handler($errno, $errstr, $errfile, $errline ) {
    throw new ErrorException($errstr, 0, $errno, $errfile, $errline);
}
set_error_handler("exception_error_handler");

printf("%d\n", $a);
?>
stderr> PHP Fatal error:  Uncaught exception 'ErrorException' with message 'Undefined variable: a' in -:7
stderr> Stack trace:
stderr> #0 -(7): exception_error_handler(8, 'Undefined varia...', '-', 7, Array)
stderr> #1 {main}
stderr> thrown in - on line 7

There is one huge problem with this. If you are building on an existing PHP project or have a ton of PHP code it's likely that you will see frequent breaks once you make failure the default. That's a direct result of the language designers choosing such lenient default behavior. If you are starting a new project using PHP you should get your head checked (see phpsadness.com). If you pass a psychological evaluation and you still for some reason want to build a new project using PHP you should turn on immediate failure by using the error handler mentioned above and write tests to exercise your code. You'll thank me later.

Now, let's look at default behaviors of some popular databases:

 % psql
# create table simple_table (col varchar(10));
CREATE TABLE
# insert into simple_table (col) values ('1234567890a');
ERROR:  value too long for type character varying(10)

 % mysql
mysql> create table simple_table (col varchar(10));
Query OK, 0 rows affected (0.23 sec)

mysql> insert into simple_table (col) values ('1234567890a');
Query OK, 1 row affected, 1 warning (0.08 sec)

mysql> select * from simple_table;
+------------+
| col        |
+------------+
| 1234567890 |
+------------+
1 row in set (0.00 sec)

Yup, by default MySQL just silently truncates your data. Ronald Bradford, a self-proclaimed MySQL Expert sums it up nicely: "By default, MySQL does not enforce data integrity." That should set off alarm bells in your head if you are using or considering using MySQL. The whole point of a database is to store valid data. The simple solution is to use a database that cares about your data like Postgres but if you must use MySQL you should set

SQL_MODE=STRICT_ALL_TABLES

For more on why this is necessary see Ronald Bradford's Why SQL_MODE is Important blog post.

PHP and MySQL are widely used. Maybe it is because their default settings are so lenient that it makes it easy for beginners to pick up. No one really cares if there was an error saving a hit on your personal homepage to your database. The problem is that these settings are not conducive to writing quality software. When starting from scratch it's better to choose technologies that have smarter defaults like Python and PostgreSQL because the libraries and software written using these technologies will properly fail instead of doing unexpected things and filling your database with garbage.

PS

You can (and should in most cases) also force hard failure for bash scripts by running set -e at the top of the script. See David Pashley's Writing Robust Shell Scripts for more.

Why cherry-picking should not be part of a normal git workflow

2011-10-20, by Dan Bravender <dan.bravender@gmail.com>

Cherry-picking Workflow

Changes are made in a maintenance branch off of the release where the bug was found. The commitid from this change is then cherry-picked into the current integration branch.

% git checkout -b maintenance-branch <release tag or commitid> # (if the maintenance branch doesn't yet exist)
% git checkout -t origin/maintenance-branch # (if the branch already exists)
% git commit -am "Made a bug fix" # note the commitid
% git push origin maintenance-branch
% git checkout integration-branch # (e.g. master)
% git cherry-pick <commitid>
# resolve conflicts
% git push origin integration-branch

Problems with the cherry-picking workflow

For example:

Under the cherry-pick workflow, even though a bugfix was cherry-picked into the integration branch git cherry -v reports that the integration branch is missing this commit from the maintenance branch.

% git cherry -v maintenance-branch integration-branch
- 33de19776f4446d92b45e1fdfb2d9c37b3a867a7 Made a bug fix

Merge Workflow

Changes are made in a maintenance branch off of the release where the bug was found (same as in the cherry-picking workflow). The maintenance branch is then merged into the current integration branch.

% git checkout -b maintenance-branch <commitid> # (if the branch doesn't yet exist)
% git checkout -t origin/maintenance-branch # (if the branch already exists)
% git commit -am "Made a bug fix"
% git push origin maintenance-branch
% git checkout integration-branch # (e.g. master)
% git merge origin/maintenance
# resolve conflicts
% git push origin integration-branch

Benefits

Confusingly enough one of the most useful tools in the merge workflow to check that the state is correct is named "cherry". It shows the commits that were made in one branch and not the other. It should show no missing changes when following the merge workflow because all the changes from the maintenance should make their way into the integration branch:

% git cherry -v maintenance-branch integration-branch
[nothing]

Draws

dongsa.net Korean Verb Conjugation Android App 2.0

2011-07-07, by Dan Bravender <dan.bravender@gmail.com>

The iPhone port of dongsa.net has had a native interface for several months now thanks to the work of Max Christian. The native interface for Android has been ready for a while but as I built it I added way too many new features. Instead of waiting until everything was fully polished I decided to strip back the new features and release an update to get the native UI out there. For those of you that were using the built-in Korean keyboard you will need to download a new input from the Market.

If you have an Android phone you can download the app directly or get it on the Android Market.

The same Javascript conjugation engine is used for both the iPhone and Android. Only the UI code has to be maintained separately. If you are curious about how this is done you can look through the source at GitHub.

One Way to Build a Federated Social Network Part 2

2011-06-03, by Dan Bravender <dan.bravender@gmail.com>

In my last post I wrote up a scheme to share structured information with friends that doesn't require a central service like Facebook or Twitter. If you didn't read that post this post will make very little sense to you. In this post I will explain how losing central control might not mean losing everything you are used to and I will revise a couple of the implementation details.

Some Questions

So, you might be thinking, if there is no central control can we find each other? Can we have a feature similar to Twitter's hashtags? Great questions. The answer is "definitely!" Currently, websites are completely federated and there are many services that allow you to quickly get publically shared information: search engines. The protocol will make it so you mark whatever information you want as public. If you don't want to be discovered you don't have to be. Search companies can write bots that crawl the nodes to gather up public information just like they do with websites today. Private will be the default setting so you will have to explicitly mark content as public. I think this is a huge step forward from centralized services because no middle man ever has to see your private information but we can still have the benefits of the centralized services.

What about backups? What if my hard disk fails? Will I lose all the photos and status updates I've posted over the years? Another great question. Let's add a feature so every post is signed digitally by your private key. In the event that something does happen you can authenticate with your private key to your friends' nodes and request all the personal information that you have shared with them and then verify the integrity of the information that you received. You will definitely want to back up your non-public assets. We could add another feature to make it so you can export encrypted copies of your content and you can back it up however you please.

Implementation Changes

Requiring a VPS or a host that is directly accessible via the internet is probably going to limit who can use a system like this. I'm starting to think that the client should run on your machine and connect with other clients via NAT hole punching. NAT punching is used to share information among peers on P2P networks so it is perfectly suited for this project. There would need to be a service that connects clients in this scenario. Perhaps the lookup could be based on some user UUID or public key signature. This is the point where, if you are paranoid, outsiders could potentially see who is connecting with whom. There would have to be a way for people to connect directly to one another as well so you can avoid the matching services if you are paranoid. I looked at a Google project called libjingle which implements TCP on UDP for telephony. Also, Skype's protocol was partially reverse engineered very recently. Some existing library will make this functionality possible.

Git is starting to look like a pretty bad choice for synchronizing data. Since you have to synchronize all or none of the data in a repository it makes it impossible to share only some of the data with certain peers. I'm going to replace Git with a much simpler protocol that offers more flexibility. Since it knows which friend is making a particular request the system can limit what data is shared with them based on your settings. The synchronization protocol will be very similar to the protocol that CouchDB uses to show what updates have been made to a database. This is what the CouchDB update feed spits out:

{"seq":12,"id":"couchid","changes":[{"rev":"1-beef2479643c2b380f99507a7767f3d5"}]}

Similarly, in the new synchronization protocol after a client authenticates to another client with their key (all clients will run SSH servers) the requesting client would make an HTTP request for changes since their last successful synchronization. The response would be a list of all the ids that have changed or been added which are visible to the peer making the request:

f572d396fae9206628714fb2ce00f72e94f2258f
7269918432597df3ec42b62acd81643d79134cf8
...

I don't want to make too grand a statement about the importance of having a decentralized replacement for services like Facebook and Twitter. I will say that I think email would have failed spectacularly if it had been centralized instead of federated and my guess is that it will be better for everyone except the investors and owners of the centralized social networks if we move to more secure distributed systems.

One Way to Build a Federated Social Network

2011-05-31, by Dan Bravender <dan.bravender@gmail.com>

There are companies making millions of dollars off of your personal information in exchange for giving you a way to easily share data with your friends. Facebook, Twitter and all the rest of these networks are all centralized services. You give them your data, they keep a copy and hopefully they share the data with only the people you told them to share it with. The funny thing is that for decades we have had email which is a federated service that gives us a less structured way to share data with our friends. With email we could send pictures to our friends. With Facebook we get the power of croud-sourcing. Our friends can tag and comment on our pictures. Surely there must be a way for us to do this in a federated way without requiring that we hand our data over to a middle-man.

There have been attempts at building a Federated Social Network. Diaspora is one such attempt that drew a lot of early buzz and funding. When I saw it I thought "thank goodness someone is solving that problem". I must say that one year on it appears to me as though they are not addressing the real problem. I was thoroughly disappointed with the result of their work: a Rails-based clone of Facebook. In my opinion what is needed here is a new federated protocol that can be easily extended with new content types and that protects access to data with private keys. On top of that new clients (web, desktop, mobile, whatever) can be built.

The following is a brain dump of one way of doing this.

Every user would have their own node or share a node with a group of people that they trust on a server of their choice. A working title for this project could be "A League of Nodes" but hopefully we'll come up with something better than that.

Basic infrastructure

Very few systems are as efficient as Git is when it comes to synchronizing data so it will be employed for sending and receiving updates.

Data will be stored in UUID filenames, similar to the way that git stores its data in .git/objects, but we will store these objects in the working tree. The files will be either JSON strings or binary data. The one required JSON field will be type. Creation date and author can be extracted from the Git logs.

A NoSQL document store such as CouchDB or MongoDB would be used to store the files and the JSON documents. At this point if you are familiar with CouchDB and its awesome built-in synchronization capabilities you might be questioning my sanity about implementing a new synchronization protocol. The problem with CouchDB's synchronization is that if we want to share with another user they would automatically get all of our friends' data as well. (There might be a way around this, please leave me a comment if you know of a way.) When an update is received from another user the UUIDs in your database would be updated with the latest content. To prevent tomfoolery UUIDs would be prefixed with your own unique UUID for the user who made the update so people could not clobber or update existing UUIDs in your database. When an update is received it is merged into your database.

A Twitter timeline or Facebook status listing is a single query:

> db.content.find({'type': 'update'}).sort({'date': -1})
{ "_id" : ObjectId("4de3d4a4475e87b4e7ce60d1"), "type" : "update", "user" : "Dan", "body" : "Dan welcomes everyone else", "date" : "Tue May 31 2011 02:32:20 GMT+0900 (KST)" }
{ "_id" : ObjectId("4de3d3f9668d1f97b29312ad"), "type" : "update", "user" : "jane", "body" : "Jane says: here I am", "date" : "Tue May 31 2011 02:29:29 GMT+0900 (KST)" }
{ "_id" : ObjectId("4de3d3db668d1f97b29312ac"), "type" : "update", "user" : "fred", "body" : "First post from Fred", "date" : "Tue May 31 2011 02:28:59 GMT+0900 (KST)" }

Your Facebook photo albums are a little more work on the client (styling and such) but not too much:

> db.content.find({'type': {'$in': ['photo', 'photo-tag', 'photo-comment']}}).sort({'date': -1})
{ "_id" : ObjectId("4de3d746475e87b4e7ce60d4"), "type" : "photo-tag", "user" : "Dan", "photo" : ObjectId("4de3d6f1475e87b4e7ce60d2"), "date" : "Tue May 31 2011 02:43:34 GMT+0900 (KST)", "x" : 20, "y" : 20, "body" : "There I am!" }
{ "_id" : ObjectId("4de3d721475e87b4e7ce60d3"), "type" : "photo-comment", "user" : "Dan", "photo" : ObjectId("4de3d6f1475e87b4e7ce60d2"), "date" : "Tue May 31 2011 02:42:57 GMT+0900 (KST)", "body" : "Nice photo if I do say so myself" }
{ "_id" : ObjectId("4de3d6f1475e87b4e7ce60d2"), "type" : "photo", "user" : "Dan", "photo" : "pointer to file in GridFS", "date" : "Tue May 31 2011 02:42:09 GMT+0900 (KST)" }

Another thing that is great about this system is that it can handle new content types that don't need to be imagined when the system is created. In the same way that web browsers handled unknown tags during their Cambrian Explosion unknown content types can either be ignored or a little blurb can be shown explaining that the client doesn't know how to handle it. Clients could even give users the option to view the raw JSON of an entry to see if there is any useful information therein.

Some problems that need addressing:

This is of course an explanation of the technical implementation of a truly federated social network. The actual implementation would need to be much more user friendly and hide these technical details from the user.

See part 2.

Korean Romanization

2011-02-15, by Dan Bravender <dan.bravender@gmail.com>

The state of Korean Romanization is a total disaster. After I learned Hahn.geul (the Korean script) I got rid of all of my textbooks that used Romanization because they were more confusing than they were helpful. Still, Korean needs a better Romanization system for foreigners visiting the country. There is a way to Romanize which is much closer to the way that words are actually pronounced in Korean.

Every book you pick up that has Romanized Korean in it seems to use a different system and they are all terrible. This is because in many cases there is no direct mapping between some Korean sounds and English sounds. Another reason is that there are many languages that use the Latin alphabet and they don't all pronounce every letter or diphthong in the same way. In 2000 the Korean government came up with yet another system (Revised Romanization) which, in my opinion, didn't do enough to fix the problems in the existing systems.

Here's one example for the Romanization of wall (벽):

Revised RomanizationMcCune-ReischauerYale
byeok (byeog)pyŏkpyek

As you can see, in some systems the initial "ㅂ" is transliterated as a "b" and in some it is a "p". I'm not entirely sure that this is something that can be addressed in a Romanization system because the sound in Korean is between the "p" and "b" in English. One of the biggest problems with the older systems is the use of accents to denote different vowels. Surely there must be a way to write out the vowels so they can be read without having to look up how the accent transforms the vowel. That is one good thing about the new system: no accents.

I believe the "eo" comes from the French who gave us "Seoul". This always trips up my non-Korean-speaking friends. In my system I have taken the sound of the Korean vowels and changed them so that the sound of the vowel is unambiguous. In this case "eo" is more like "uh" and then "oo" smashed together. In my system "서울" is "Suh.ool". Periods are placed between consonants.

My wife attended "Soongsil" University. I believe it was Romanized this way because it is transliterated. If you take each component of "숭실" and turn it into a list you would get "ㅅㅜㅇㅅㅣㄹ". Transliterate that without context and you would end up with something similar to "Soongsil". However, in Korean if you have a "ㅅ" followed by certain vowels it actually becomes "sh". Another confusing bit of the existing transliteration is "ㅣ" to "i". Usually without an "e" the "i" is short like in "sit". The actual sound is usually more like "ee" in "eel". When you meet a Korean whose last name is Lee their actual name is actually just "ee". There is no "l" sound at all in the beginning of their name (unless they are North Korean... but you probably won't have too many chances to meet many North Koreans). In the system I have created "숭실" becomes "Soong.sheel" because that's how it's actually pronounced.

The system I have come up with is not a direct transliteration. It first runs the Korean string through pronunciation rules and then it is transliterated from the output of the pronunciation engine.

You can try it out below or on dongsa.net but I doubt that it will work in older browsers or Internet Explorer:


dongsa.net iOS App

2010-11-12, by Dan Bravender <dan.bravender@gmail.com>

Thanks to the work of Max Christian dongsa.net now has an iOS port which should work on your iPhone, iPad or iPod. You can download it from the iTunes App Store.

Thanks Max!

Here are some screenshots:

dongsa.net Android App

2010-10-13, by Dan Bravender <dan.bravender@gmail.com>

Way back in June someone contacted me and asked me if dongsa.net, an online Korean verb conjugator that I wrote, was available as an iPhone app. There wasn't an app but the email inspired me to rewrite dongsa.net in Javascript so that it could be run offline on devices that support HTML 5. This is possible since dongsa.net uses an algorithm and not a database to conjugate Korean verbs. If you are interested in Korean verb conjugation you might want to check out my earlier article where I announce dongsa.net.

The other day I picked up an Android phone here in Korea and I decided to see how hard it would be to turn the Javascript-based version of dongsa.net into an app. Turns out it's not that hard at all. The application is just an embedded web browser. Since Android phones come with a WebKit-based browser they can handle pretty much any site you throw at it. This is all the Java code I had to write:

package us.bravender.android.dongsa;

import android.app.Activity;
import android.os.Bundle;
import android.webkit.WebView;

public class Dongsa extends Activity {
    @Override
    public void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.main);

        WebView engine = (WebView) findViewById(R.id.web_engine);
        engine.getSettings().setJavaScriptEnabled(true);

        engine.loadUrl("file:///android_asset/html/index.html");
    }
}

Here are some screenshots:

If you have an Android phone you can download the app. The source is all over at GitHub. The android directory contains all the code necessary to build the app and the html directory contains all of the Javascript and HTML required for the offline version of dongsa.net.

Some Assembly Required

2010-04-01, by Dan Bravender <dan.bravender@gmail.com>

Tired of slow web frameworks?

Does your website need to scale? Are you sick of adding new servers to handle the load of your extremely popular website? Do you have a frozen feature set and no desire to maintain your code? If so, Some Assembly Required is the scaling solution you've been looking for.

Upgrading from a scripting language
  1. Manually convert your program from the slow dynamic language it is currently written in to an nginx assembly module.
  2. Reconfigure nginx to serve the module.
  3. Sit back and watch as your iron-level optimizations improve the speed of your site by several orders of magnitude (or less).
  4. Fight all requests for new features and bug fixing with this simple phrase: "Are you kidding me? It's written in assembly! It will take months to add that new feature or track down that bug."
  5. Relax.
Example

Here's a hello world example.

The example was based on code from George Malamidis.

Productive PHP

2010-03-29, by Dan Bravender <dan.bravender@gmail.com>

I have personally said some pretty harsh things about PHP over the years. It's hard to say nice things about a language which doesn't have namespaces (apparently they are going to be introduced in PHP 6). This has led to a standard library that eats up the global namespace. On my box at home there are over 1,345 functions defined globally when I run get_defined_functions(). The number of defined functions depends on which modules you have installed. That said, I have started using PHP again and I'm determined to make it a more pleasant experience this time. I have discovered several tools that make working in PHP less painful and I'd like to share them.

One of the things that really bugs me about PHP is that it doesn't tell me what led up to an error. In Python you get a traceback every time an exception is raised. The good news is that if you install the xdebug module you can get nice tracebacks of PHP errors. Problem solved.

When developing I need to be able to use a REPL (Read Eval Print Loop) so I can make sure my assumptions about how the language works are correct. PHP doesn't come with a REPL but fortunately the guys at Facebook made phpsh, which is a great tool. It says a lot that the people at Facebook needed a tool like this since they have to put up with PHP every day. After installing it you can load any PHP files and then run arbitrary PHP commands and see the output. It also pulls up documentation from a database and lets you reload external PHP files. I cannot understand how people develop in PHP without it.

Last but not least is testing. This is where some of PHP's design flaws start to show up. I am a proponent of Test Driven Development which is very hard to do in PHP. Why? Because in PHP you cannot redefine global functions. This means that any call to mysql_query is going to be a call to whoever defines mysql_query first. When refactoring code that doesn't use OOP it can be very hard to ensure the code is doing what you expect except through manual testing. I banged my head against this over the weekend and I came up with GlobalMock, a way to make it so PHP can do late binding of global calls. It does require that you slightly modify your code. Here is a simple piece of code before:

<?
mysql_connect('localhost', 'root', '');
mysql_select_db('test');

function get_uid($username) {
    $query = sprintf("select uid from users where username='%s' limit 0,1",
                     mysql_real_escape_string($username));
    $result = mysql_query($query);
    if (!$result) {
        return false;
    }
    list($uid) = mysql_fetch_array($result);
    return $uid;
}

And here is what it looks like after applying GlobalMock so runtime binding is possible:

<?
require_once('../global_mock.php');
$gm = GlobalMock::getInstance();

$gm->mysql_connect('localhost', 'root', '');
$gm->mysql_select_db('test');

function get_uid($username) {
    $gm = GlobalMock::getInstance();
    $query = sprintf("select uid from users where username='%s' limit 0,1",
                     $gm->mysql_real_escape_string($username));
    $result = $gm->mysql_query($query);
    if (!$result) {
        return false;
    }
    list($uid) = $gm->mysql_fetch_array($result);
    return $uid;
}

Now actual unit tests can be written:

<?
require_once('../test/simpletest/autorun.php');
require_once('../global_mock.php');

class TestUser extends UnitTestCase {
    function testInitialization() {
        $gm = GlobalMock::getInstance();
        $gm->testing();
        $gm->add_expected('mysql_connect',
                          new GlobalMockIgnore(),
                          true);
        $gm->add_expected('mysql_select_db',
                          array('test'),
                          true);
        require_once('user_testable.php');
    }

    function testGetUid() {
        $gm = GlobalMock::getInstance();
        $gm->testing();
        $gm->add_expected('mysql_real_escape_string',
                          array('angelo_luis_martin'),
                          'angelo_luis_martin');
        $gm->add_expected('mysql_query',
                          array("select uid from users where username='angelo_luis_martin' limit 0,1"),
                          'results_of_search');
        $gm->add_expected('mysql_fetch_array',
                          array('results_of_search'),
                          array('1'));
        $this->assertEqual(get_uid('angelo_luis_martin'), '1');
    }
}

If $gm->testing() isn't called the arguments will be passed to the globally defined function. I hope this helps make PHP easier to test and therefore less painful to use. If you want to give GlobalMock a spin the source is up for grabs on GitHub.