4 Nov 2002. Daniel Hellerstein (danielh@crosslink.net) GrabSite: GET a set of linked documents from a WWW site GrabSite version 1.20 is designed to copy a WWW site to your local hard disk. It's easy to use: basically, you just specify the URL of the home page you want to "grab", and then specify a destination directory (on your hard drive) into which the web pages should be copied. GrabSite is freeware, please read the disclaimer at the bottom of this document. I) Installation: To install GrabSite, just copy GRABSITE.CMD to your hard drive, and then execute it from an OS/2 prompt. For example, if you copied GRABSITE.CMD to D:\GRAB> D:\GRAB>grabsite You will be presented with several (non-gui) questions. There is a smattering of on-line help -- just hit the ? key. GRABSITE.CMD is a REXX file -- ambitious users can modify the user-configurable parameters by editing (using your favorite text editor) the user-changeable-parameters section at the top of GRABFILE.CMD. Note: GRABSITE uses the RxSock and RexxUtil dynamic link libraries (DLLs). In almost all cases, these DLLs will already be on your machine (they are part of OS/2). If you don't have them, you can download them from http://www.srehttp.org/pubfiles II) Description: Basically, GrabSite works by: a) Initializing a "to retrieve" list with the URL (the "home page" that you requested Then, GrabSite works it's way down the "to retrieve" list b) Get the "top" entry in the "to retrieve list" c) GET (using socket calls) this URL d) Copy the contents (of what was just retrieved) to the destination directory. d) If it's a text/html document (as determined by examining the Content-type response header), parse the contents and extract "links"; including (anchor), (image), (frame), and (imagemap) links. d) Add these extracted links to the bottom of the "to retrieve" list e) Discard this top entry, and if there is anything left in the "to retrieve" list, go back to step b. In practice, there are a number of modifications possible to these steps. For example * GrabSite can skip retrieval of links (URIs) that points to a script (say,to a CGI-BIN script) * GrabSite can skip retrieval of links that are not in the directory, or a subdirectory, of the requested "home page" * GrabSite can retrieve, but not parse, links that are under a parent of the "home page" * GrabSite can skip links that start with user-selectable strings (say, that start with a !) * GrabSite can read a site's ROBOTS.TXT file and avoid specified links. For further details, run GRABSITE and answer Y to the Would you like to modify configuration parameters? question; and see the on-line help. Or, better yet, read the top of GRABSITE.CMD! ------------------------- Disclaimer: GrabSite is freeware that is to be used at your own risk -- the author and any potentially affiliated institutions disclaim all responsibilties for any consequence arising from the use, misuse, or abuse of this software (or pieces of this software). You may use this (or subsets of this) program as you see fit, including for commercial purposes; so long as proper attribution is made, and so long as such use does not in any way preclude others from making use of this code. -------------------------