Rishab Aiyer Ghosh on Mon, 8 May 2000 16:52:53 +0200 (CEST) |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
<nettime> OFSS01: First Orbiten Free Software Survey |
OFSS01: The Orbiten Free Software Survey, 1st edition, May 2000 Copyright (C)2000 Orbiten Research - http://orbiten.org May be distributed freely without modification. Press contact, or for more information: ofss@orbiten.org HEADLINE: Over 12,000 authors, 25 million lines of code analysed Inside: FINDINGS DATA SCOPE AND METHOD CONTEXT: ORBITEN REFERENCES The Free Software (or Open Source) "Community" is much talked about, though little hard data on this community and its activities is available. Here, for the first time, Orbiten Research (see CONTEXT) provides a body of empirical data and analysis to explain what this community actually is. Simple facts, such as the number of developers contributing to free software projects, the number of such projects and their size have been until now unknown. The Orbiten Free Software Survey discovers these facts, and aims with them to provide a foundation for empirical research on the free software community. Building on the release of CODD[1] over a year ago, the Survey will measure and track over time several aspects of the free software economy including: the concentration (or diversity) of contributions and contributors; the degree of intersection between projects and sharing of code; the participation of developers in different projects; volatility of changes to the code base and the developer base. There will also be some basic statistics and data gained during the survey process - such as total size of free software available, amount of free software being released and/or modified each month, compendium of developers. Hopefully the survey will be regular, prompt and gradually more comprehensive, providing an important source of information for academic researchers, free software users and developers alike. Rishab Aiyer Ghosh & Vipul Ved Prakash: May 7, 2000 FINDINGS The primary findings of OFSS01 were basic: the number of developers authoring projects included in the survey (12706), the size of the free software code base (1.04 Gigabytes, or roughly 25 mil lines), the number of identifiable free software projects (3149). Given the total lack of data on the free software economy, rough indicators as to its size (limited by the initial scope of the survey) are, we believe, a good start. Secondary findings relate to the degree of contribution to the code base by individual authors, defined for the purposes of this survey as the smallest identifiable grouping claiming credit for development of a software project. Unsurprisingly, the Free Software Foundation came out well ahead of anyone else by far, credited with 11% (124 Mb) of the entire surveyed code base and involved in 17% (546) of all identifiable projects. However, as with some other well-known (and highly ranked in the survey) Unix authors, such as Sun Microsystems and the Regents of the University of California, the FSF's position in our charts stems largely from the lack of credit given to individual programmers. A list of the top few contributors sorted by code and involvement in projects is given below (see DATA). Further findings relate to the distribution of authors among projects, and code base contribution. The top 1271 authors, 10% of the total, accounted for 72.3% of the total code base. The top 10 authors alone (0.08% of the total) are credited for 19.8% of the code base. Free software development may be distributed, but it is most certainly very top heavy. What goes for lines of code written goes for involvement in projects too. Only the top 25 authors (0.19% of the total) were credited with participation in more than 25 projects. The top 250 authors were credited with participation in over 5 projects, and the vast majority (over 77%) of authors were only involved in a single project. Our conclusion: Free software development is less a bazaar of several developers involved in several projects, more a collation of projects developed single-mindedly by a large number of authors. DATA Number of identifiable authors: 12706 Uncredited/unidentifiable authors: 790 % of code base uncredited: 8.37% Size of code base: +1116500467 Bytes or 1067 Mb. Number of identifiable projects: 3149 Table 1: Top 10 authors ranked by contribution of code Author % of total free software foundation, inc 11.231 sun microsystems, inc 1.848 the regents of the university of california 1.359 gordon matzigkeit 1.216 paul houle 1.042 thomas g. lane 0.782 the massachusetts institute of technology 0.762 ulrich drepper 0.559 lyle johnson 0.528 peter miller 0.525 Table 2: Author contribution by decile Authors % of total top 10 authors 19.854 top decile (1271) 72.320 2nd decile 8.928 3rd decile 4.062 4th decile 2.384 5th decile 1.515 6th decile 1.008 7th decile 0.672 8th decile 0.440 9th decile 0.239 10th decile 0.060 Table 3: Top 10 authors ranked by participation in projects Author Projects free software foundation, inc 546 gordon matzigkeit 267 the regents of the university of california 156 ulrich drepper 142 roland mcgrath 99 sun microsystems, inc 66 rsa data security, inc 59 martijn pieterse 50 eric young 48 login-vern 47 Table 4: Author participation in projects Projects Authors > 25 25 6 - 24 211 3 - 5 928 Only 2 1924 Only 1 9617 Note: 211 authors participated in 6 to 24 projects, etc. Further data, graphics and complete tables available at orbiten.org SCOPE AND METHOD The first Orbiten Free Software Survey has been prepared based on over 18 months of work in identifying, tracking and modeling interaction in the free software economy. Clearly this was not enough time, and the scope and methodology of the first survey is far from ideal. The technical task of identifying credits in poorly documented source code was complex, especially given the vast and changing nature of the code base. Credits are often not available, they rarely follow a set format, and various heuristics have been applied and "policy" decisions made on, for example, how to divide credit among multiple listed authors. Details can be found in the documentation for CODD[1]. The code base itself was limited. Although far from being a complete set of all code ever released without payment on the Internet - our ideal, eventual goal - we believe we have used a fairly representative sample of software projects (released under the GNU Public Licence and its variants) developed in recent years. The source code base for OFSS01 is: * RedHat Linux v6.1 source rpms, including Linux kernel 2.2.14 * Munitions cryptography/security archive as on January 11, 2000 [http://munitions.vipul.net] * Approximately 50% of source code available through Freshmeat as on January 5, 2000. Explanation: source code is not easily available for all projects on Freshmeat, at least when accessed through an automated script with simple intelligence. [http://freshmeat.net] For each module or package analysed, source code is broken into projects identified according to the package distribution. Source code and some documentation files are scanned for authorship, credit or copyright information, from which author names are identified. Data collected includes, for each identified author, number of bytes of code authored, number and names of projects authored. From this the degree of contribution, in terms of bytes of code can be calculated for any given project. Project data is collated to form a broader picture of authorship distribution, which can be examined at several levels. In this survey, very basic analysis has been performed. The next survey will broaden the scope of analysis to include features such as the degree of cross-participation between projects and groups of authors. The next survey - planned for June - will also use a bigger code base. At the very least the code base will expand to include Sourceforge [http://sourceforge.net], OpenBSD [http://openbsd.org] and Perl CPAN libraries [http://cpan.org]. As the survey continues and becomes more frequent, we plan to track changes in the code base over time (including historical perspectives using older versions of, say, the Linux kernel) and monitor movement between projects and groups. CONTEXT: ORBITEN Orbiten Research is devoted to the practical understanding of Cooking-pot networks[2], the economic model for trans-monetary phenomena on the Internet. A special focus is on developing tools of measurement and generating data on the production, use and trade in free ("open source") software. Modelling communities and economic activity usually depends on measurement, which is why it seems very hard to model cooking- pot networks - such as the community of free software developers. Orbiten plans to develop and use various methods of getting around the "problems" of cooking-pot networks, of modelling and understanding them so that their benefits can be truly appreciated and worked with. A summary of these methods can be found[3] on the Orbiten web site. REFERENCES [1] CODD documentation, Orbiten. http://orbiten.org/codd/ [2] "Cooking-pot markets" by Rishab Aiyer Ghosh, First Monday, Issue 3 Volume 3 March 1998. http://www.firstmonday.org/issues/issue3_3/ghosh/ [3] "Identifying, tracking and measuring activity in cooking-pot networks" by Rishab Aiyer Ghosh, Orbiten. http://orbiten.org/summary.html # distributed via <nettime>: no commercial use without permission # <nettime> is a moderated mailing list for net criticism, # collaborative text filtering and cultural politics of the nets # more info: majordomo@bbs.thing.net and "info nettime-l" in the msg body # archive: http://www.nettime.org contact: nettime@bbs.thing.net