Google said they know how big is the web. There is about 1 trillion (as in 1,000,000,000,000) unique URLs on the Internet.
Back in 1998 Google’s index had 26 million pages, in 2000 it reached 1 billion mark and in 2008 its 1 trillion…wait, no, it’s not how many websites they store, its just how many unique URL’s they’ve found.
In fact, we found even more than 1 trillion individual links, but not all of them lead to unique web pages. Many pages have multiple URLs with exactly the same content or URLs that are auto-generated copies of each other. Even after removing those exact duplicates, we saw a trillion unique URLs, and the number of individual web pages out there is growing by several billion pages per day.
Google index is estimated to hold about 40 billion pages and growing every day.
They do not store all pages they could. The reason: its too expensive and many websites are spam and Google does not want to hold spam websites in their indexes.
We don’t index every one of those trillion pages — many of them are similar to each other, or represent auto-generated content..that isn’t very useful to searchers. But we’re proud to have the most comprehensive index of any search engine, and our goal always has been to index all the world’s data.
Well, the most comprehensive index of any search engine? Not so sure. Cuil is indexing more than 120 billion pages.
As of this time, Google has stated that their are approximately 130 trillion pages online.
Google Web Crawlers
There are different types of search engines that use different methods of searching the Internet resources.
Robots read only links located on already found websites and based on that they create tree-like link hierarchy. Spiders read the whole content of a website, the title, links, the document text and information inside meta tags. There are alsoengines like Metacrawlers, which include Meta Search and Smarter Meta Search engines. The first one sends the search term entered by the user to indexes of different search engines simultaneously and returns requested information’s based on found search results. So they don’t have their own index servers and they do not search the internet in an active way. They use already existing indexes of other search engines. Smarter Meta Search engines use the same method with a difference. They use linguistic and collective analysis to determine even more accurate search results than engines.
Search engines do not index everything that is available on the internet though. They do not index:
- Multimedia files – mp3, mpeg, avi, jpg, gif, png
- Documents to which a password is required – e.g. mailboxes or intranet, an inside Internet of companies
- Sites that have been excluded from indexing by its author using robots.txt file or noindex, nofollow meta tags
All of the websites that exist but are excluded from search engine indexes for different reasons create so called “invisible network” or “invisible internet”. This invisible internet is three times bigger than all sites creating the visible internet. It is like that because companies or government institutions in an obvious way do not want to share with their data. That’s why they exclude their content from indexes or hide it inside intranet.
Google calls its spider “Googlebot”. It divides into Freshbot and Deepbot. The task of Freshbotis to find new, fresh content on a website and that’s why it visits the same sites even couple times a day. Deepbot on the other hand is responsible for deep website crawling. Its main purpose is to create a full view, a picture of your website content, navigation system and all the inbound and outbound links. If you see a search engine results change it means the Deepbot was active. In Google the search process is divided into three parts.
First the Googlebot crawls the internet in search for changes to existing websites and for new websites. It works like standard web browser. It sends request to a server for specific website and then saves it at sends it to index servers. It can request thousands of different websites simultaneously, but it deliberately makes it slower to avoid crashing web servers or overcrowding the real human requests to the same server.
Googlebot finds sites in two ways: by adding a website through an add URL form (Google Add URL) or by finding links by crawling the internet.
When Google fetches a website it collects all the links on this site and adds it to “visit soon” URL list. That way in short time it can visit wide area of internet and make the search process faster. But it also causes problems. Google have to examine the “visit soon” URL list to check if there are duplicates of URL addresses and if so to delete the duplicate to prevent from visiting the same site too often.
To keep the indexes up to date Google re-crawls sites on regular basis. For a newspaper site or a highly visited portal it can be daily, for a stock quotes much more frequently and for other pages once a month or several times a month.
Googlebot sends then the whole text it finds on sites to Google indexer, an indexing database servers. They store the text sorted alphabetically by term – a keyword or phrase. To each term a list of documents in which it appears and where on site is attached so it’s easy to find the location of correct document for certain user query. To eliminate unimportant words, to improve the search process and to make it faster Google indexer doesn’t take into consideration most frequent words in each language. These words, called “stop words” (such as is, on, or, the, at, in, how, why) don’t make the search process any more precise, so they can be ignored.
When a user enter a search term to Google, it sends the request to indexing servers to find out if the term exists in the database. At this point it is important to realize that because of the amount of data Google holds it would be to difficult to store all information’s in one indexing server. That’s why Google uses many separate servers where each holds some part of all data. The query is send therefore to different servers simultaneously and if the term exist in the database Google generates 1000 most relevant results based on more than 100 factors (such as PageRank™, metatags, age of a website, traffic on website and many more).
At the same time to each document a special number is attached. The Document-ID is then send to file servers where a title and description of a website is added based on its metatags. If there are no metatags the title and description are generated automatically based on sites content. Also in this case there are many file servers working simultaneously.
The last stage is adding advertisements to the search results taken from the ad-servers adequate to the search term. The ad-servers keep information’s about advertisers, campaigns and they determine which advert should be published on search results page. Those ad-servers are the main income of Google company. They bring 98% of overall revenue of this search engine.
All the information’s are put together and displayed in a user web browser as dynamically generated website. And all of that is done is seconds.
For more information’s about how Google works take a look on these websites: