Monday, February 8, 2010

Cloud Sourcing: Self-Organizing Information Search

“If in other sciences we should arrive at certainty without doubt and truth without error, it behooves us to place the foundation of knowledge in mathematics.”

- Roger Bacon


A few weeks ago, while trying to avoid overhearing a conversation involving infant mortality that my fellow passengers were having on the train, I zoned out and thought of the Cloud – my comfort zone. It has often struck me as somewhat peculiar that the few e-mail addresses I use are tied into so many other portals, forums, social networks, web applications and the like, but there is no way to automatically correlate the information dispersed in this fashion all over the blogosphere.

Alice and Bob were having dinner with Carol; a friend of Bob’s. Alice was particularly interested to find that Carol was a metallurgical engineer, a field she was still only a sophomore in. They exchanged numbers via SMS, and Alice posted a tweet about having a fun dinner with Carol the Blacksmith. Now, years later, Alice has graduated and needs her foot in the door to land her first job, she broke up with Bob, and changed her phone and lost Carol’s name and number in the process. She tweets almost all day long, so there is no way of going back to look up even Carol’s name, much less her contact information. All she remembers is that there was a friend of Bob’s who worked in her industry and she got along really well with.

Now Alice’s intuition tells her that Carol can help her get where she wants – the only problem is Carol no longer exists in Alice’s memory. Despite the flow of information recording her name, her phone number, and the date and place they met, there is no longer anything that links to the information. Searching for tweets about dinner somewhere in a two year period returns hundreds of results, and asking Bob is out of the question (even if he was the One *sniff, sob*). So here she is, in the midst of a recession, an expensive degree she needs a job to pay for, and no hope of finding one.

Enter the Cloud

Here’s a candidate for how her thought process may be structured as a search string  

“Friend of Bob” && metallurgical engineer && gender: Female && “Dinner” && Year: 2005 || 2006

The problem is there was no reference to Carol’s profession in either the text (which, in this setting, passes through the MPLS cloud and is stored somewhere in the interwebs) or the twitter post. And, as a result of their frequent socializing with Bob’s friends there is a veritable mountain of irrelevant twitter posts, chat logs, e-mails, SMS’s, online restaurant reservations and Myspace events. The proverbial needle in the haystack just got smaller, and not only does Alice want to find it, she needs to put the thread through its eye.

Now what Alice needs is a way to better express what she’s looking for or information that is sorted coherently or both. For the rest of this post, I’ll be talking about information organization (detailed comments for non-geeks).

On Alice’s Twitter page:

char post = “Dinner at Bosco’s with Bob and Carol – she’s awesome!!!”

Now, instead of storing the above string as an entry in a database, the following processing is performed on the search string:

void Main (char post)


keywords.setSubject(post) ;

/*Identifies the primary topic of the post using syntactical

evaluations in the post itself

keywords.subject at this point returns [Bob, Carol, Dinner, Bosco’s]



// people = Bob, Carol, Bosco (last mistaken as person)


people = keywords.subject.people

For each (people)


bool check = newContact();

if (check)


lookup (newContact)


people.mutuality() ;

/* identifies Carol(s) in Bob’s friends list, Bosco not found, and assigned as possible place, Bob identified as friend


people.buildProfile (facebook, myspace, gmail, SMS, twitter, blog, live…);

/* Reads information on Carol that is publicly

available and fills in whatever information it can find – sample information structure with sources may be found at the end of this pseudocode






keywords.setPlace() ; /* Bosco’s checked against possible places, matched and stored


keywords.setActivity() ; // activity = dinner


Carol’s Body Snatcher

After identifying a possible match for the Carol mentioned in the twitter feed (she happens to be in Bob’s friends list on Facebook e.g.), the next step is to discover as much information about Carol as possible. A sample routine of the web-crawling process might look like this:

  1. Find Carol on Facebook

  2. Fill in her name, gender, Date of Birth etc. (The spider only finds that she’s an alumni of Duke University)

  3. Find a link to her blog on the Facebook page

  4. Find a link to Carol’s LinkedIn public profile on her blog

  5. Fill in her professional information and contact information from the LinkedIn page
If there is more than one Carol found, of competitive relevance (she also happens to be living in the same town and had dinner plans that evening according to her Twitter feed), both could be explored and stored.
The final result is that three years after their first and only meeting, Alice, having broken up with Bob and completely forgotten everything about Carol except her line-of-work, still has a way of touching base with the person again. Let’s take another look at the search string:

“Friend of Bob” && metallurgical engineer && gender: Female && “Dinner” && Year: 2005 || 2006

This time, despite never having mentioned Carol as a metallurgical engineer, and despite that not being her designation on her LinkedIn profile, the spider discovers that the closest thing to a metallurgical engineer who also fits the bill for the rest of the description, happens to be Carol Watson, Engineer at Northwestern Steel Corp. The question mark in the earlier scenario becomes the pivotal point of the search itself.

Other Examples

This hypothesis can be extended easily from social networking-based people search into other domains. One example that jumps to mind is looking for a song one has only heard a bit of and does not recall the name of. Searching for a fragmented melody by comparing with genres mostly listened to on YouTube, Pandora and the like.

Endless Possibilities

Of course the truth is that there is simply not enough bandwidth, fast-accessed memory, or processing power to fully process each and every random musing of each and every random person. But I am struck at this time by the thought that when the ARPANET was formed in the late 1960’s, few could have imagined what it would eventually turn into. I sometimes have difficulty even thinking of life before Google (BGE?). Combined with the explosive growth in networked computing resources, we have no conclusive means of estimating how the Internet may evolve in the coming decades.

I hope to do another couple of entries on this subject in the coming weeks, hopefully with more maths and less semantics. Stay tuned.

1 comment:

  1. I really want to know who found this one funny and why. ^^