Michael Wojcik on Thu, 7 Jan 2010 11:28:51 +0100 (CET)


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: <nettime> fast-changing propaganda website archiving tools?


Flick Harrison wrote:
> Hey nettimers,
> 
> I'm trying to archive some government propaganda websites for a  
> research project.  I'm on mac but could access linux or PC tools in a  
> pinch.
> 
> All the various things I've tried have failed to maintain the full
> interactivity / flash linking within the kind of page I'm wanting.

It would help to know what you've tried, then. You mention "things
like Fink and Wget". What would those "things" include?

If you haven't tried HTTrack (WinHTTrack for Windows, WebHTTrack for
UNIX and Linux), I'd suggest that. It's free, open-source, and
reasonably easy to use, configure, and automate. I used WinHTTrack to
record changes to US presidential candidate websites in 2007-2008, for
a visual-rhetoric project, and it did the job.

http://www.httrack.com/

Note that in general, though, there are any number of ways that people
make websites difficult to successfully copy and archive. Basic
honor-system methods like robots.txt (which the Wayback Machine
respects, for example) and client sniffing are easy to bypass - you
just ignore or spoof them (and HTTrack has an option for that). But
techniques like traffic shaping, keying served content to ephemeral
session cookies, and scripts that inspect document URLs require
considerably more finesse.

While it's axiomatic that anything served can be saved, the work
factor for saving something can be made pretty high - often higher
than the content in question is worth to the person trying to save it.
(That's what security is all about, of course: making the work factor
for the attacker high enough to invert the economics of the attack,
without doing the same to the work factor for authorized parties.)

-- 
Michael Wojcik
Micro Focus
Rhetoric & Writing, Michigan State University




#  distributed via <nettime>: no commercial use without permission
#  <nettime>  is a moderated mailing list for net criticism,
#  collaborative text filtering and cultural politics of the nets
#  more info: http://mail.kein.org/mailman/listinfo/nettime-l
#  archive: http://www.nettime.org contact: nettime@kein.org