4 Nov 2002.
Daniel Hellerstein (danielh@crosslink.net)

                GrabSite: GET a set of linked documents from a WWW site

GrabSite version 1.20 is designed to copy a WWW site to your local hard disk. 

It's easy to use: basically, you just specify the URL of the home 
page you want to "grab", and then specify a destination directory 
(on your hard drive) into which the web pages should be copied.

GrabSite is freeware, please read the disclaimer at the bottom of this
document.


I) Installation:

To install GrabSite, just copy GRABSITE.CMD to your hard drive,
and then execute it from an OS/2 prompt.

For example, if you copied GRABSITE.CMD to D:\GRAB>
      D:\GRAB>grabsite
You will be presented with several (non-gui) questions. 
There is a smattering of on-line help -- just hit the ? key.

GRABSITE.CMD is a REXX file -- ambitious users can modify
the user-configurable parameters by editing (using your 
favorite text editor) the user-changeable-parameters section
at the top of GRABFILE.CMD.

Note:
  GRABSITE uses the RxSock and RexxUtil dynamic link libraries (DLLs). 
  In almost all cases, these DLLs will already be on your machine
  (they are part of OS/2). If you don't have them, you can download them
  from http://www.srehttp.org/pubfiles

II) Description:

Basically, GrabSite works by:
  a) Initializing a "to retrieve" list with the URL (the "home page" that
     you requested
Then, GrabSite works it's way down the "to retrieve" list
  b) Get the "top" entry in the "to retrieve list"
  c) GET (using socket calls) this URL
  d) Copy the contents (of what was just retrieved) to the destination
     directory.
  d) If it's a text/html document (as determined by examining
     the Content-type response header), parse the contents and extract
     "links"; including <A> (anchor), <IMG> (image), <FRAME> (frame), 
     and <MAP> (imagemap) links.
  d) Add these extracted links to the bottom of the "to retrieve" list
  e) Discard this top entry, and if there is anything left in the
     "to retrieve" list, go back to step b.

In practice, there are a number of modifications possible to these steps.
For example
  * GrabSite can skip retrieval of links  (URIs) that points to a
    script (say,to a CGI-BIN script)
  * GrabSite can skip retrieval of links that are not in the directory, or
    a subdirectory, of the requested "home page"
  * GrabSite can retrieve, but not parse, links that are under a parent
    of the "home page" 
  * GrabSite can skip links that start with user-selectable strings (say, that 
    start with a !)
  * GrabSite can read a site's ROBOTS.TXT file and avoid specified links.
 
For further details, run GRABSITE and answer Y to the
   Would you like to modify configuration parameters?
question; and see the on-line help.
Or, better yet, read the top of GRABSITE.CMD!



                        -------------------------
Disclaimer:

   GrabSite is freeware that is to be used at your own risk -- the 
   author and any potentially affiliated institutions disclaim all 
   responsibilties for any consequence arising from the use, misuse, or abuse 
   of this software (or pieces of this software).

   You may use this (or subsets of this) program as you see fit,    
   including for commercial purposes; so long as  proper attribution
   is made, and so long as such use does not in any way preclude 
   others from making use of this code.

                        -------------------------


