The DNA of Search

The internet.  It's a big old place.  Full of stuff.  Files, stories, movies, music, pictures, news, reviews.  You name it, the internet has a virtual online version of it.  But how do you find what you want?  Via a search engine of course.

The search engine of choice is generally seen to be Google.  Obviously there are local variations to this, with Baidu in China for example and other more specialised engines such as ChaCha which focuses more on human analysis of the results instead of pure computational searching.   However, to generally get the most out of the internet you need to search, index and categorise what you want to view.

The basic idea behind a search engine is firstly for it to create an index of available web pages.  This index is created by automated robots or spiders, that crawl as many existing public web pages as possible, checking links and identifying the contents of the HTML pages to allow searches to be performed.

A user would then enter a list of keywords (sometimes combined with some operators such as AND, OR and NOT) to help explain what they are looking for.  The search engine scans it's index trying to perform a basic match.  The result set that the search engine returns, is then presented to the user.

Now this result set is the important part.  The result set could be quite small, in which case it's generally pretty easy for the person searching, to quickly validate and and discard any results which they deem to be inaccurate, inappropriate or just darn right bad.  However, in general, the result set will be too large to process by hand.  It could generally contain several thousands hits or sites that would need to be verified or ranked, based on their content.

Can you trust what you're looking for? (via morgueFile.com)


Most search engines will attempt to perform some basic ranking process.  The ranking could be based on using keywords that other humans have utilised programmatically over a period of time, or assigning values to index results such as the number of links within a site and so on.  Each search engine will have a proprietary way of ranking results data, which will result in different engines producing different results.

Many search engines will promote the idea of net neutrality which allows network services, responses and searches to be created unhindered and free from the likes of government, corporate or competitive interference.

But can a search engine be free from bias?  Many search engines utilise advertising to generate a revenue stream and do those advertise links cloud the true search result?  Google will identify a paid for link by tagging with the word 'sponsored' next to it to provide some clarity.

One other major form of search bias is based on previous user search history.  The idea behind this is to try and personalise the results set based on what the user has previously searched for and the subsequent websites they have clicked through to.  But this increased personalisation, whilst may have its benefits, starts to reduce the opportunity for new and random results.  The user becomes increasingly held within their own bubble of navigation and knowledge, not knowing what they don't know.

The main concern with such an approach, is that the end user has no real knowledge of the results ranking and parsing process, so they become unaware of other potentially valuable search results at their disposal.

It will be interesting to see over the coming years as the internet undoubtedly becomes larger and more diverse, whether search engine theory and the underlying ranking algorithms can become sophisticated enough to produce personalised content, whilst remaining open to the random and new.