Patrice Riemens on Tue, 24 Mar 2009 06:37:38 -0400 (EDT) |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
<nettime> Ippolita Collective: The Dark Face of Google, Chapter 4 (First part) |
NB this book and translation are published under Creative Commons license 2.0 (Attribution, Non Commercial, Share Alike). Commercial distribution requires the authorisation of the copyright holders: Ippolita Collective and Feltrinelli Editore, Milano (.it) Ippolita Collective The Dark Side of Google (continued) Chapter 4 Algorithms or Bust! (Part 1) Google's mind-boggling rate of growth has not at all diminished its reputation as a fast, efficient, exhaustive, and accurate search engine: haven't we all heard the phrase "if it's not on Google, it doesn't exist!", together with "it's faster with Google!". At the core of this success lies, besides elements we have discussed before, the PageRank[TM] algorithm /we mentioned in the introduction/ which steers Google's spider's forays through the Net. Let's now look more closely at what it is, and how it works. Algorithms and real life An algorithm [*N1] is a method to resolve a problem, it is a procedure built up of sequences of simple steps leading to a certain {desired} result. An algorithm that actually does solve a problems is said to be accurate, and if it does so speedily, it is also efficient. There are many different types of algorithms, and they are used in the most diverse scientific domains. Yet, algorithms aren't some kind of arcane procedures concerning {and known} only {to} a handful of specialists, they are devices that profoundly influence our daily lives, much more so than would appear at first sight. Take for instance the technique used to tape a television programme: based on algorithms; but so also methods to put a pile of papers in order, or to sequence the stop-overs of a long journey. Within a given time, by going through a number of simple, (re)replicable steps, we make a more or less implicit choice of algorithms that apply to the problem solving issue at hand. 'Simple' in this regard, means foremost unequivocal, readily understandable for who will put the algorithm to work. Seen in this light, a kitchen recipe is an algorithm: "bring three liters water to the boil in a pan, add salt, throw in one pound of rice, cook for twelve minutes and sieve, serve with a sauce to taste", all this is a simple step-by-simple step description {of a cooking process}, provided the reader is able to interpret correctly elements such as "add salt", and "serve with a sauce to taste". Algorithms are not necessarily a method to obtain completely detailed results. Some are intended to arrive at acceptable results {within a given period of time} [French text: 'without concern for the time factor' - which doesn't sound very logical to me -TR]; others arrive at results through as few steps as possible; yet others focus on using as few resources as feasible [*N2]. It should also be stressed /before going deeper into the matter/ that nature itself is full of algorithms. Algorithms really concern us all because they constitute concrete practices meant to achieve a given objective. In the IT domain they are used to solve recurrent problems in software programming, in designing networks, and in building hardware. Since a number of years, due to the increasing importance of network-based reality analysis and interpretation models, many researchers have focused their studies on the construction methods and network trajectories of the data which are the 'viva materia' of algorithms. The 'economy of search' John Batelle writes about [*N3] has become possible thanks to the {steady} improvement of the algorithms used for information retrieval, developed in order to augment the potential of data discovery and sharing, this with an {ever} increasing degree of efficiency, speed, accuracy, and security. The instance the general public is the most familiar with is the 'peer-to-peer' {('P2P')} phenomenon: instead of setting up humongous databases for accessing videos, sound, texts, software, or any other kind of information {in digital format}, ever more optimised algorithms are being developed all the time, facilitating the creation of extremely decentralised networks, through which any user can make contact with any other user in order to engage in {mutually} beneficial exchanges. The Strategy of objectivity The tremendous increase of the quantity and quality of bandwidth, and of memory in our computers, together with rapidly diminishing costs, has enabled us to surf the Internet longer, better, and faster. Just twenty years ago, modems, with just a few hundred bauds (number of 'symbols' transmitted per second) of connectivity, were the preserve of an elite. Today, optic fiber criss-crosses Europe, carrying millions of bytes per second, and is a technology accessible to all. Ten years ago, a fair amount of technical knowledge was required to create digital content. Today, the easiness of publishing on the World Wide Web, the omnipresence of e-mail, the improvement of all kinds of online collective writing systems, such as blogs, wikis, portals, mailing lists, etc. together with the dwindling costs of registering Internet domains and addresses, have profoundly changed the nature of users: from simple users of information made available to them by IT specialists, they have increasingly become creators of information themselves. The increase in the quality of connectivity goes together with an exponential augmentation of the quantity of data send over the networks, which, as we have pointed out earlier, entails the introduction of steadily better performing search instruments. The phenomenon that represents this pressing necessity exerts a deep attraction on social scientists, computer science people, ergonomists, designers, specialist in communication, and a host of other experts. On the other hand, the 'informational tsunami' that hits the global networks cannot be interpreted as mere 'networkisation' of societies as we know them, but must be seen as a complex phenomenon needing a completely fresh approach. We therefore believe that such a theoretical endeavour cannot be left to specialists alone, but demand a collective form of elaboration. If indeed the production of DIY network constitutes an opportunity to link autonomous realms together, we must also realise that the tools of social control embedded in IT technologies represent a formidable apparatus of repression. The materialisation of this second scenario, most spectacularly exemplified by the Echelon eavesdropping system [*N5], looks {unfortunately} the most probable, given the steadily growing number of individuals who are giving information away, as opposed to an ever diminishing number of providers of search tools. The access to the information that is produced by this steadily growing number of individuals is managed with an iron hand by people who are both retaining the monopoly of it while at the same time reduce what is a tricky social issue into a mere marketing free-for-all contest where the best algorithm wins. A search algorithm is a technical tool activating an extremely subtle marketing mechanism, as the user trust that the search returns are not filtered and correspond to choices made by the 'community' of surfers. To sum up, a trust mechanism is triggered into the objectivity of the technology itself, recognised as 'good' because it is free from human individuals' usual idiosyncratic influences and preferences. The 'good' machines, themselves issued from 'objective' science, and 'unbiased' research, will not tell lies since they cannot lie, and in any case don't have any interest in doing so. Reality, however, is very much at variance with this belief, which proves to be a demagogic presumption - the cover for fabulous profits from marketing and control. Google's case is the most blatant example of this technology-based 'strategy of objectivity'. Its 'good by definition' search engine keeps continuous track of what its users are doing in order to 'profile' their habits and exploits this information by inserting personally targeted and contextualised ads into all their activities (surfing, e-mailing, file handling, etc.). 'Lite' ads for sure, but all pervasive, and even able to generate feedback, so that users can, in the simplest way possible, provide information to vendors, and thus improve the 'commercial suggestions' themselves by expressing choices. This continuous soliciting of users, besides flattering them into thinking that are participants in some vast 'electronic democracy', is in fact the simplest and most cost-effective way to obtain commercially valuable information about the tastes of consumers. The users' preferences and their ignorance {about the mechanism unleashed on them}) is what constitutes and reinforces the hegemony of a search engine, since a much visited site can alter its content as consequence of the outcome of its 'commercial suggestions': a smart economic strategy indeed. Seen from a purely computer science point of view, search engines perform four tasks: retrieving data from the Web (spider); stocking information in appropriate archives (databases); applying the correct algorithm to order data in accordance with the query, and finally, presenting results on an interface in a manner that satisfies the user. The first three tasks each requires a particular type of algorithm: search & retrieval; memorisation & archiving; and query. Google's power, just as Yahoo!'s and other search giants /on the network/ is therefore based on: 1. A 'spider', that is a piece of software that captures content /on the net/ 2. An enormous capacity to stock data on secure carriers, and a lot of backup facilities, to avoid any accidental loss of data. 3. An extremely fast system able to retrieve and order the returns of a query, according to the ranking of the pages. 4. An interface at the user's side to present the returns of the queries requested (Google Desktop and Google Earth, however, are programmes the user must install on her/his machine beforehand). Spiders, databases and searches The spider is an application that is usually developed in the labs of the search engine companies. Its task is to surf web pages from one link to the next while collecting information, such as document format, keywords, page authors, next links, etc. When done with its /exploratory/rounds, the spider software sends all this to the database for archiving /this information/. Additionally, the spider must monitor any changes on the sites visited so as to be able to programme its next visit and stock fresh data. The Google spider, for instance, manages two types of site-scans, one monthly and elaborate, the so-called 'deep crawl', the other daily, 'fresh crawl', for updating purposes. This way, Google's databases are continuously updated /by the spider through its network surfing/. After every 'deep crawl', Google needed a few days to actualise the various indexes and to communicate the new results to all {its} data-centers. This lag time is known as the "Google Dance": the search returns used to be variable, since they stemmed from different indexes. But Google has altered its cataloguing and updating methods from 2003 onwards, and has also spread them much more in time, resulting in a much less pronounced 'dance': now the search results vary in a dynamic and continuous fashion, and there are no longer periodic 'shake-ups'. In fact, the search returns will even change according to users' surfing behaviour, which is archived and used to 'improve', that is to 'simplify' the identification of {the} information {requested} [*N6]. The list of choices the application is working through in order to index a site is what constitutes the true force of the Google algorithm. And while the PageRank[TM] algorithm is patented by Stanford, and is therefore public, later alterations have not been /publicly/ revealed by Google, nor,{by the way}, by any other search engine company existing at the moment. And the back-up and recovery methods used in the data centers are not being made public either. Again, from a computer science point of view, a database is merely an archive in digital format: in its simplest, and till now also its most common form, it can be represented as one of more tables which are linked together and which have enter and exit values: these are called relational databases. A database, just like a classic archive, is organised according to precise rules regarding stocking, extraction and continuous enhancement of {the quality of} the data /themselves/ (think recovery of damaged data, redundancy avoidance, continuous updating of data acquisition procedures, etc). IT specialists have been studying for decades now the processes of introduction, quality improvement, and search and retrieval within databases. To this end, they have experimented with various approaches and computer languages (hierarchies rankings, network and relational approaches, object oriented programming, etc.). The building up of a database is crucial component of the development of a complex information system such as Google's, as its functionality is entirely dependent on it. In order to obtain a swift retrieval of data, and more generally, an efficient management of the same, it is essential to identify correctly what the exact purpose of the database is (and, in the case of relational databases, the purpose of the tables) which must be defined according to the domains and the relations that link them together. Naturally, it becomes also necessary to allow for approximations, something that is unavoidable when one switches from natural, analog languages to digital data. The itch resides in the secrecy of the methods: as is the case with all proprietary development projects, as opposed to those which are open and free, it is very difficult to find out which algorithms and programmes have been used. Documents from research centers and universities allow a few glimpses of information on proprietary projects, as far as it has been made public. They contain are some useful tidbits to understand the structure of the computers {used} and the way search engines are managing data. Just to give an idea of the computing power available today, one finds descriptions of computers which are able to resolve in 0,5 microsecond Internet addresses into the unique bits sequences that serve to index in databases, while executing 9000 spiders {'crawls') at the same time. These systems are able to memorize and analyze 50 million web pages a day [*N7]. The last algorithmic element hiding behind Google's 'simple' facade is the search system, which, starting from a query by the user, is able to find, order, rank and finally return the most pertinent results to the interface. A number of labs and universities have by now decided to make public their research in this domain, especially regarding answers {to problems} that have been found, and the various methods used to optimise access speed to the data, {questions about} the complexity of the systems, and the most interesting instances of parameter selection. Search engines must indeed be able to provide almost instantaneously the best possible results while at the same time offering the widest range of choice. Google would without doubt appear as the most advanced search engine of the moment: as we will see /in details/ in the next chapter, these extraordinary results cannot but be the outcome of a very 'propitious' {form of} filtering... For the time being, suffice to say that the best solution resides in a proper balance between computing power and the quality of the of the search algorithm. You need truly extraordinary archival supports and indexation systems to find the information you are looking for when the mass of data is written in terabytes (1 TB = 1000 gigabytes = 1000 raised to 3 bytes), or even in petabytes (1 PB = 1000 TB [or 1024 TB , Wikipedia's funny...-TR]), and also a remarkable ability to both determine where the information is in the gigantic archive and to calculate the {fastest} time needed to retrieve it. And as far as Google's computing capacities are concerned, the Web is full of - not always verifiable nor credible - {myths and} legends, especially since the firm is not particularly talkative about its technological infrastructure. Certain sources are buzzing about lakhs [See Chapter 1 ;-)] of computers interconnected through thousands of gigantic 'clusters' [sitting on appropriate GNU/Linux distros - French text unclear]; others talk about mega-computers, whose design comes straight out SciFi scenarios: humongous freeze-cooled silo's where a forest of mechanical arms move thousands of hard disks at lightning speed. Both speculations are just as plausible {or fanciful}, and do not necessarily exclude each other. In any case, it is obvious that the extraordinary flexibility of Google's machines allows for exceptional performances, as long as the system remains 'open' - to continuous {in-house} improvements, that is. (to be continued) -------------------------- Translated by Patrice Riemens This translation project is supported and facilitated by: The Center for Internet and Society, Bangalore (http://cis-india.org) The Tactical Technology Collective, Bangalore Office (http://www.tacticaltech.org) Visthar, Dodda Gubbi post, Kothanyur-Bangalore (http://www.visthar.org) # distributed via <nettime>: no commercial use without permission # <nettime> is a moderated mailing list for net criticism, # collaborative text filtering and cultural politics of the nets # more info: http://mail.kein.org/mailman/listinfo/nettime-l # archive: http://www.nettime.org contact: nettime@kein.org