gekaprog (geka7) wrote in google_groups,

How Search Engines Work

I am fascinated by the search engine technology and the level of innovation put into the product. I'd like share my thoughts and solicit yours on how those engines might be designed. Quite unoriginally, I am going to base my article on Google and bunch of other search engines. As a disclaimer, this is not an attempt to reverse engineer. Article is written up a sequence of short paragraphs, titled Px.

P1. I'd like to start with basic product requirements. In the nutshell, search engine finds the information. In the context of web, it responds to a user query with a set of web links.

P2. In recent years, engines have gone a step further by starting to index electronic books so the answer you get may be a page out of textbook or MIT video lecture. Thus, the old professor's argument, "don't trust everything you read on the internet" is losing its ground.

P3. A search begins with a database. To answer a user query, company has to source the data and structure it. Engines continuously crawl the web and build a compilation of sites. Another process could scan online books into the database. The engines are unlikely to copy content verbatim instead they categorize it. That may change as technology speed and storage capacities increase. In certain cases like music song lyrics or news articles, the compilation may be exact because users queries may be address particular passages such as "love was such an easy game to play".

P4. When a site starts receiving a lot of user hits, the search engine may increase its scan depth and
start saving more of its content driven by user demand.

P5. A quick optimization, if multiple web sites offer identical content, for example a song lyric, the information can be saved once. The engine would then save links to those sites separately thus reducing space requirements.

P6. To improve accuracy, engines keep a table of how frequently site content changes. For example, a news site may publish an article every hour while another site is updated monthly. Search engine does not to have to scan the latter as often.

P7. There should be a continuous feedback loop between search engine results clicked by the user and the database content.

P8. This problem can be solved by a supervised flavor of machine learning algorithm. A particular input, search query invokes a certain set of links where users decides which ones are most relevant. That constitutes an output. The goal of machine learning is then given an input (query) to return the right output(search results) so that user can get to the best answer in just a few clicks.

P9. In every case, engine has to know what user clicked. To accomplish that, returned links are pointing back to search engine site with a final destination appended into get string. This lets the engine know what user clicked and update its own statistics table.

P10. Every search engine is aiming to return relevant results in top few top lines.

P11. Search results are likely customized based on your geographical location as indicated by user IP, input language, time and currency of his region, etc. Basically, the engine is profiling him or her and attempting to improve its responses based on the rest of the population and events going on in their region. For example, if one is located in United States and start typing "presidential", search box may auto complete "presidential elections" and provide top three queries chosen by other users in that location in the last 24 hours. This understanding of search algorithm had been used by various companies who claim to be able to raise your site rating in Google's cache.

P12. Engine is constantly tuning its cache of search results based on user's input, as mentioned in P7. It also evaluates what I'd like to call a success rate factor. If user clicks over 5 to 10 links after a query, it points to the fact that results do not answer the question. If such a trend continues from multiple users, the search engine has to work harder to find an answer for the query.

P13. For certain responses, answers are static. For example, a query for Marilyn Monroe may automatically produce a link to Wikipedia. The result is static and hardly modifiable unless overwhelming evidence tells the search engine people stopped following Wikipedia.

P14. In some cases, search engine starts to play a nasty psychological trick on people instructing them to the right answer. Most people are not experts in the field and when they look for something, they believe in search engine having done its homework because after all these products are backed up by powerful companies and lots of smart people. The best answer to your question may actually be a modest math article published by some professor in Hungary but that link rating is low. I like to think of it as an automated librarian who consistently points people to wrong textbook with utmost authority.

P15. Web site speed or accessibility may play a major role in its probability of showing up. Google is not going to provide a link to dead or overloaded site. Thus, if you like to show in search engine results, get a better provider.

P16. Because every search engine is coming up with ways of indexing web and trying to stay on top of changes, it may actually start offering free hosting to most frequented sites. Why? Majority of internet traffic can actually be Google, Bing, etc indexing the web. It would make sense to host sites in say Google's data-center and get a real-time, internal network access to content changes.

P17. Given the power of search engines, they could come up with different ways of building web sites. The goal would be to simplify scanning and indexing world wide web.

P18. When you type the query, engines tries to guess your question. It is an auto-complete and auto-correct. Based on its database, it lights up a few query suggestions that had been asked most recently. Suggestions are based on a premise that a lot of users are asking the same question in slightly different terms. User John has just asked "Daylight savings in New York", while David is starting to type "Time daylight...". Engine may recognize the similarities and display top three related "best" questions. With a high degree of probability, both users are inquiring about day light saving change.

P19. For users who are logged into Google, Yahoo or Microsoft, search engine get a more personalized experience. Engine remembers their requests and links followed and may rank results differently from general population.

P20. I am fairly confident certain users are rated higher than others based on their academic status or position of power. This would organically and humanly improve search results without keeping a team of search experts on payroll in every field.

P21. Similar to P20, certain sites like and are rated higher than others. Links to these sites would routinely come up on the top.

P22. It is interesting how search engines would further improve their result accuracy for those sites that are not searched for. For example, if I know that my answers can be found on, I will go there directly. How would the engine get that information unless it buys traffic logs from the site provider and cross references against its own IP-based database. This starts to look a lot like NSA tactics of spying on people.

P23. In many cases, large search engine providers are significantly better at searching external sites than those sites can ever do. As an example, one is more likely to find an answer on by searching in Google than using stackoverlow own internal search. It probably means sites no longer need to create their own search.

P24. In light of P23, when sites want to be discovered and properly indexed, they should proactively reach out to Google/Yahoo/Bing and submit its latest content. This model creates a paradigm shift where large search engines no longer have to seek and scan but rather sit and wait for site owners provide their content.
  • Post a new comment


    default userpic