alex's blog

AnnoMarket - Text Annotation in the cloud

Last week I attended the Text Analytics Meetup in London where I saw a presentation about the beta launch of AnnoMarket.com. [Disclaimer: Since AnnoMarket.com is currently Beta quality code anything I say might be improved or totally wrong by the time of full commercial release]

AnnoMarket seems to solve one technical problem in the field of Natural Language Processing (NLP): annotating documents. The business problem is “How do I automatically annotate my text so that I can see what is in that text without employing lots of human beings to read it all and tag it up manually”

Co-Founders Wanted

I was at a StartUp event last night: "Co-founders Wanted! Find or Become a London Co-founder". This was a networking event organised by Bizoogo - a new name to me. All in all I quite enjoyed it. I volunteered my services to someone who is starting a project connected to processing financial information: she wanted people skilled in BigData and Text Analytics.

I also told another Entrepreneur to contact one of my recent clients involved in the fashion research industry.

LinkedIn and Neo4J

I was recently asked to get some data from LinkedIn and put it in a graph database Neo4J. (http://www.linkedin.com/ and http://www.neo4j.org/). I found an article talking about this subject and followed it (http://blog.neo4j.org/2013/08/exploring-linkedin-in-neo4j.html by Rik Van Bruggen )

Basically he takes you through registering your application with LinkedIn, using a simple python script to access the REST api, saving data in simple CYPHER format and viewing it in Neo4J.

Behemoth and Text Analysis

We were asked recently to process a lot of word documents to extract information from them. This article talks about some of the tools we are using.

GATE

The core system we use for this sort of thing is called GATE - The General Architecture for Text Engineering. ( http://gate.ac.uk/ ) Basically GATE is a java based framework for people (mostly academics!) to build text processing pipelines. It uses a number of other components itself through an extensive plugin mechanism. These plugins include:

* Tika for parsing different file formats (http://tika.apache.org/.)

Hadoop Book review:Instant MapReduce Patterns – Hadoop Essentials How-to

Well, I said I would review this book but I have been putting it off. I just don't think I can do it justice. I did not really like this book but I can't properly say what is wrong with it.

The title of the book is the first problem

"Instant MapReduce Patterns – Hadoop Essentials How-to"

Now when I see "MapReduce Patterns" I was really expecting design patterns. Logical algorithms for how to do various tasks with MapReduce algorithms. I was expecting something like the Gang Of Four book on Design Patterns. That isn't what this book is.

One-Jar helps package java apps

So I recently had need to package up a Spring Roo Java application so that it could be run by someone else remotely. Now most SpringRoo apps I know are web apps - deployed as WAR files to a web server like Tomcat. But I was asked to make sure it ran from the command line. The main problem I had was that all my third party jar files are stored in my local Maven repository.

Hadoop User Group - UK Oct 13th

Well, I hope to make it to the Hadoop User Group UK meeting in London tomorrow - but I am suffering from a cold. I hope to see you there, but am not sure whether I will be well enough. Booo.

http://skillsmatter.com/event/nosql/data-integration

By the way - the HUGUK mailing list is moving to Meetup.com
You still need to register with Skillsmatter for events at Skillsmatter offices...

Hadoop User Group - UK

As usual. I intend to be at the Hadoop User Group UK meeting in London next week. Here is the original announcement from Dan Harvey.

Now the summer break is drawing to a close we're got our September
meetup coming up. It's going to be on the 8th September at
SkillsMatter again with the theme for the evening being on the hadoop
ecosystem. This has evolved quite a bit since we last met and has
sparked quite a bit of discussion.

We've got two talks arranged for the evening :-

- Dan Harvey will start with a talk on the "State of the Hadoop ecosystem"

Pages