Announcing lotsofwords.com

Lots of Words is a new experiment to mash up a Wikipedia based lexicon with images from Flickr and whatever else I can get my hands on, in the context of building a representative and informative source of translations for any particular word, in any particular language. I’m trying to keep things as machine-readable as possible for now, so others can build on it, too.

My friend Patrick Hall and I have been musing about it for some time, and only now a technology stack allowed me to do this as a relatively small hack rather than putting together months of optimization work.

It turns out, indexing something as big as Wikipedia (check out those dump file sizes!) isn’t really an “idea in the head and 500 lines of code”, unless you use the right tools for the job. In this case, a shiny new CouchDB instance at Amazon EC2, a bit of Ruby and Merb to add a some logic and presentation magic, and JQuery as a finishing touch did the trick. This gets pretty much every Web N.0 buzzword covered, although I haven’t yet made any millions in an iPhone app.

This is a spare-time project, so it made sense for me to try out as many different bits of new technology as possible and make it into a breakable toy. This is its third implementation, and the first I’m really happy with in terms of performance and malleability. CouchDB, even with 21+ million documents loaded in about 120 GB of storage, still responds in under 200ms times on all queries I’ve tried so far. It truly is, even in its pre-1.0 days, a fantastic piece of software.

Now I find myself wanting to put a nice front-end to this, and while the current Flickr mash-up is already very interesting—and, it turns out, solves the problem of cross-language information retrieval for a small subset of Flickr, I’m sure others will have much more useful ideas about what to do with this data. My colleague Robert Rees has helped put together a hackfest here in the ThoughtWorks UK office, together with the nice folks from the London JavaScript Meetup group. Come join us 12 November!

If you just want to get to the code (be forewarned it is ugly!), it’s on GitHub.


General

Comments (14)

Permalink

60 years of the UDHR

It’s been nearly 60 years since the Universal Declaration of Human Rights has been ratified, and Seth Brau has created a amazing typographical interpretation of its text:

While you’re watching it, please take a moment to reflect about the ways in which your human rights have been challenged over the powers-that-be over your lifetime, and think of ways in which you can stop these challenges and abuses from happening.

On the geeky side, it’s also interesting to note that the UDHR is the most translated text in the world. I’d expect no less, and it’s interesting to note how technically challenging that is — apart from getting translators the world over to get the wording just right so it doesn’t become ripe for abuse and misinterpretation, it is also one of the benchmarks for proper character set handling in any particular computer system — and if you think whatever you’re building is up to scratch, you must give the translations a go. Chances are you’ll end up with mojibake all over the place at some point.

General

Comments (1)

Permalink

Opportunity makes the thief

Applying the agent nouns are code smells rationale, nicely worded by my colleague Peter Gillard-Moss, controllers are code smells. The more I look at REST and the more I look at how MVC gets typically mapped to it, the more I think the C in MVC is doing the wrong thing, even when it’s skinny as possible.

I’ve only come across very few and extremely limited cases where I need to do more in a Rails or Merb controller than simply delegate to the model and set some options about how the resource is going to be represented. Most of this code can be inferred by simply looking at the available routes and methods on the model, essentially making them declarative statements rather than fully capable objects that encapsulate some behaviour of my system, or coordinate inputs and outputs.

In fact, in a RESTful application where simple and CRUD-like behaviour is encouraged in (or even expected from) all resources of the domain, controllers are just plain unnecessary. If you’ve tried something like CrudTemplate, resources_controller or resource_controller, that’s the sort of thing I’m talking about… only without the the controller as a class.

Sinatra gets this right by doing away with the controllers and allowing you to tie blocks of code to URL matchers. Still, opportunity makes the thief, and allowing a block of code to decide what’s going to be rendered or what models get called is still going to put developers in a position where they have to fight to keep their control code skinny, while moving as much functionality as possible into a rich domain model. It’s fighting an incredibly difficult battle, since the refactoring weapons are just not powerful enough at the moment, and might never be. I’d prefer declarative statements: what, not how.

Because declarative, ladies and gentlemen, is good.


General

Comments (0)

Permalink

About the Death of the Working Hours

I fundamentally disagree with the idea that a software development team should be constrained to working 9 to 5. Most software developers don’t even have that luxury, working overtime to compensate for all the mistakes people like Fred Brooks told us about some decades ago. In this post, I’d like to pretend more of us are treated and behave as being further to the left of the scale that starts with “Factory Worker” and ends in “Knowledge Worker”, while fully understanding it’s a whole different world “out there”, for whatever definition of “out there” is that invalidates this particular rant.

It seems funny to me that we enthusiastically build highly informative and interactive environments for teams to play in (and I use the word play in the context of a project as a game), and then treat the human beings whose minds are supposed to be completely focused on delivering business value to a customer as machines that clock in, are amazing for 8 hours… and then punch out, go home and resume their personal lives.

And this is where I want to have my cake and eat it, too. The fundamental disagreement I have with the concept of working 9 to 5 (or any other 8-hour period of the day) is that creativity, enthusiasm and logical reasoning can’t be switched on and off by the magic powers of a commute—and mine these days include walking past the Camden Lock and Market, so it’s pretty close to that. And I mean it in a good way: it’s great that developers don’t just switch off when they go back home. It’s why we have so many great open source projects coming out of what seems like pure cognitive surplus.

In fact, the very existence of a cognitive surplus tells me that I actually go through all these good ideas throughout the day. It just so happens they are sprinkled all over it, as my brain happily responds to outside stimuli, which could be an information radiator telling me something’s just happened, or the taste of my favourite local ice cream from the shop down the road. More and more, I want my work to be a part of my life, not a slice of it. As Erin Brockovich, who once allegedly yelled,

Not personal!? That is my work, my sweat, and my time away from my kids! If that’s not personal, I don’t know what is!

If you ask your parents, or maybe grandparents, perhaps you’ll get the same answers as I did. They told me my job is my job, and I have to do it so I can put food on the table. It’s left to my imagination that very few people in my group a generation or two ago had the pleasure of working because they truly like what they do and believe what they do is both positive and meaningful to society.

I think I’ve moved up the pyramid a little bit, that I shouldn’t feel or be embarassed by thinking about “work stuff” for extended periods of time when I’m about to go to bed on a Saturday morning. In fact, that’d get me labelled as someone who’s “focused” and, nowadays, even praised as a workaholic. But if I’m caught talking and thinking about something else entirely for similarly extended periods of time while in the middle of my 8-hour journey, I’m slacking off.

Why?


General

Comments (7)

Permalink

Forms on the Web and the Missing Stubs

I have changed my mind a lot over the years about web development. I think I have reached another one of those points of inflection, thanks to incredibly bright folks like Simon Stewart, Dan Worthington-Bodart, Jim Webber and George Malamidis. Unfortunately, it took me a lot longer than they did to figure this out, but at least I’m writing about it :)

About a month ago, a recent trip to the Brazilian Consulate General to renew my passport made a few things click. We talk a lot about forms on the web, but it’s really rare I get to fill in a form in real life. It’s a very different experience, and while at the same time it’s somewhat painful in some respects, there are lessons to be learned.

The process goes like this: you queue up to the first booth, and an attendant asks you about what service you require and gives you a coloured piece of paper with a number and a form to fill in. They call the number on that stub when it’s your turn to be seen. When called, you present the stub, form and any necessary supporting documentation to another attendant, who gives it a good check and tells you to go over there to pay a fee. Again, you get called by the number, pay the fee, come back and the attendant checks the receipt. She then decides that your application should be processed and staples the stub to another receipt and tell you to come back in a few days. When you get back, you present the stub, and they hand you the passports.

That tiny little piece of paper is the essential thing we’ve missed on the web. As an example, I’ll use what Rails and Merb generate in the RESTful scaffolding. In this case, you get the magic 8 CRUD actions:

  • index
  • new
  • create
  • show
  • edit
  • update
  • delete
  • destroy

I’m really interested in new, here. Digging a little deeper, you’ll see:

Looks reasonable. Let’s try it out:

This would be the equivalent of being handed out a form to fill in in real life… but all I got was the form—where’s the stub? How is the application on the other side going to know I’m talking about the same interaction?

You could argue that that’s the exact reason why the cookie is there, but the cookie doesn’t represent this particular interaction. It represents my browser’s (or other HTTP agent’s) interaction with the whole app. In real life, I couldn’t use the same stub to also fill in my tax returns, I’d have to get another one, probably of a different colour, even. I need something that the server can use to track this particular form being filled in, for reasons I’ll discuss later.

One quick and easy solution to this is to add an UUID to that form. UUIDs are guaranteed to be unique, and are pretty cheap to generate. So cheap in fact, there’s no reason not to slap one on the form itself:

This allows us to track the entire process of filling in a web version of my little passport application workflow. In HTTP-speak, that workflow would be something like:

  • GET /passport_applications/new.xml (200 OK)
  • POST /passport_applications (201 Created)
  • GET /fee_payments/new.xml?for=09711c30-40d5-012b-3f7b-001ec212da96 (200 OK)
  • POST /fee_payments (201 Created)
  • PUT /passport_applications/09711c30-40d5-012b-3f7b-001ec212da96 (202 Accepted)
  • GET /passport_applications/09711c30-40d5-012b-3f7b-001ec212da96 (200 OK)

A benefit to using an UUID to identify resources is already evident here: because they are unguessable, there’s no problem in using them on URLs for privacy-sensitive documents, as it is extremely unlikely that potential attackers would be able to hit arbitrary UUIDs and get to something other than a 404 Not Found.

Another benefit is that UUIDs also work really well as artificial primary keys in relational databases. SQLServer, Oracle, MySQL, PostgreSQL and most other RDBMSs support some UUID type, or have a UUID function. This means we don’t need sequential IDs on our tables, and while they need a little extra storage, the upside is that they don’t have to perform expensive synchronization on the sequences. If you are not using an RDBMS and need that extra little bit of cheap scalability, document databases such as Amazon SimpleDB, CouchDB, HBase and Google BigTable also love UUIDs.

So what kinds of cool stuff can you do if you buy some more storage and collect data about every step of an user interaction, even when that interaction wasn’t successful? Imagine that every time the number on my stub got called and I talked to the attendant, she also took a photocopy of my form and documents before handing them back. What could be done with that data, given some spare cycles?

I’m sure others will have many more interesting ideas, but the one that jumps to mind mind immediately is being able to see how long your forms are taking to complete and exactly at which step people trip on common validation mistakes. That data can answer questions like “is it worth adding some JavaScript that checks the email format of this field on the spot?” and “after signing up and logging in, what is the first thing my users do?”, and it can answer that with a lot more detail and accuracy than you would get by trawling the HTTP server logs or adding something like Google Analytics to your pages.

Suppose that you discovered that quite a few of your users are having trouble paying for the fee—they haven’t been told how much it was, and they had no cash at hand! You could then work out a solution, from the simplest (putting up a list of fees near the entrance) to the most complete (accepting credit and debit cards and putting a cash machine next to the booth). You could even let the process happen asynchronously: users can choose to pay when they come back to get their new passport if it’s more convenient, for example. And, best of all, it’s perfectly possible to do these things while being really nice to HTTP servers, proxies, caches and other bits of the infrastructure of the web. It’s really what REST is about, building and playing nice with the web’s infrastructure… isn’t it?


General

Comments (3)

Permalink