Lots of Words is a new experiment to mash up a Wikipedia based lexicon with images from Flickr and whatever else I can get my hands on, in the context of building a representative and informative source of translations for any particular word, in any particular language. I’m trying to keep things as machine-readable as possible for now, so others can build on it, too.
My friend Patrick Hall and I have been musing about it for some time, and only now a technology stack allowed me to do this as a relatively small hack rather than putting together months of optimization work.
It turns out, indexing something as big as Wikipedia (check out those dump file sizes!) isn’t really an “idea in the head and 500 lines of code”, unless you use the right tools for the job. In this case, a shiny new CouchDB instance at Amazon EC2, a bit of Ruby and Merb to add a some logic and presentation magic, and JQuery as a finishing touch did the trick. This gets pretty much every Web N.0 buzzword covered, although I haven’t yet made any millions in an iPhone app.
This is a spare-time project, so it made sense for me to try out as many different bits of new technology as possible and make it into a breakable toy. This is its third implementation, and the first I’m really happy with in terms of performance and malleability. CouchDB, even with 21+ million documents loaded in about 120 GB of storage, still responds in under 200ms times on all queries I’ve tried so far. It truly is, even in its pre-1.0 days, a fantastic piece of software.
Now I find myself wanting to put a nice front-end to this, and while the current Flickr mash-up is already very interesting—and, it turns out, solves the problem of cross-language information retrieval for a small subset of Flickr, I’m sure others will have much more useful ideas about what to do with this data. My colleague Robert Rees has helped put together a hackfest here in the ThoughtWorks UK office, together with the nice folks from the London JavaScript Meetup group. Come join us 12 November!
If you just want to get to the code (be forewarned it is ugly!), it’s on GitHub.

Diego Plentz | 02-Nov-08 at 12:31 pm | Permalink
A clear description would help a lot. Btw, if you fill just the word, the browser is redirected to an invalid adress(like http://word/)
Jan | 02-Nov-08 at 1:18 pm | Permalink
Hi Carlos,
congrats on the launch!
The lotsofwords UI, while brilliantly minimalist, could do with a bit of explanation. Or a “fill fields with example” button or so?
Cheers
Jan
–
Fabio Nascimento | 02-Nov-08 at 2:47 pm | Permalink
Hi Carlos.
I am testing, but I am redirected to page 404.
why?
Some explication?
Put labels on fields, that help, really, or some readme.
However, it is an interesting project, I am studying the code
Fabio Nascimento
Jason Davies | 03-Nov-08 at 12:06 am | Permalink
This is awesome. Thanks for indexing the Welsh Wikipedia
ara.t.howard | 03-Nov-08 at 4:06 am | Permalink
i get a bad redirect too.
Nuno Job | 03-Nov-08 at 5:18 am | Permalink
Impressive stuff
Lawrence Pit | 03-Nov-08 at 5:21 am | Permalink
Awesome. Can you also give an indication how long it took to load all that data into couchdb? And was this on a small standard EC2 instance?
Carlos Villela | 03-Nov-08 at 9:57 am | Permalink
@Diego, @Fabio, @Ara – the form was a last-minute addition and really doesn’t work that well. I’ll get around to fixing it today. Thanks for the bug report!
@Lawrence – It’s a small instance, and it took approximately a week to load it up, plus another 2 or 3 weeks to generate the view index. I sped things up a bit by creating a couple of EBS volumes and merging them with RAID-0.
Jan | 05-Nov-08 at 10:54 am | Permalink
Heya,
do you want to add yourself to http://wiki.apache.org/couchdb/CouchDB_in_the_wild?
Cheers
Jan
–
Dirceu Jr. | 05-Nov-08 at 12:39 pm | Permalink
wow! wikipedia? impressive.
i’m using couchdb + merb too, but for storage products in a price comparison engine. so far from a wikipedia index. wow again.
plok | 09-Nov-08 at 5:05 pm | Permalink
News at Couch – November 2008…
Welcome to another installment of News at Couch, our review of what’s new with, on and around CouchDB.
What’s the most interesting new idea you’ve seen in the field of web development in the past year?
New database architectur…
Ol | 13-Nov-08 at 9:39 am | Permalink
Thanks for an excellent presentation of your work at the hackfest! LotsOfWords is a technology showcase and art statement rolled into one. Very exciting stuff.
Guilherme Chapiewski | 15-Nov-08 at 2:59 pm | Permalink
Freakin’ awesome
(and really fast too).
Congrats!
Bob Dionne | 28-Nov-08 at 5:32 pm | Permalink
Very nice, well done. I had to guess how to use it but it more or less works. Somethings are very wrong, .eg. english to chinese seems broken.
What’s very impressive is the performance, which I suspect it gets from CouchDB