18 December 2002 Contact: Daniel Hellerstein (danielh@crosslink.net) CheckLink ver 1.20: Create, display, traverse, and index a web-tree Abstract: CheckLink is a multi-threaded, socket aware utility used to create, verify, traverse, and index a web-tree; where "web-tree" is defined as all URLs (in-line images, anchors, etc.) that are referenced in a chosen HTML document, and in documents reachable from this document. CheckLink can be run as an SREhttp/2 addon, or from an OS/2 command prompt. ------------------- Contents: 1. Introduction 1.a Quick Start 1.b. Web Tree? Does that make sense? II. Installation II.a. Installing as an SREhttp/2 addon. II.b: Using CheckLink as a standalone program III. CheckLink parameters. III.a. A Note on How CHEKLINK displays results III.b. CHEKLINK, CHEKLNK2, and CHEKINDX parameters IV. CHEKLINK Options -- Create a Web Tree V. CHEKLNK2 Options -- Display and Traverse a Web Tree VI. CHEKINDX -- Create an Index of a Web Site VI.a CHEKINDX Options VI.b CHEKINDX edit mode VII. CHEKRPT report writer -- Report information about a Web Tree VIII. CHEKFIX "fix" busted URLs-- Note busted links in files that contain them IX. Notes X. Disclaimer ------------------- I. Introduction CheckLink is a robot that is used to create, verify, traverse and index a web-tree. In other words, CheckLink will find and variously display all the URLs (such as anchors and in-line images) that appear in a set of HTML documents. In particular, CheckLink will: ... given a "Starter-URL" provided by a client: a) use TCP/IP socket calls to obtain the contents of the html document (that this "Starter-URL" points to) Alternatively, in standalone mode you can use the FILE:///filename.ext syntax to read & process a file on your hard disk. b) find URLs referred to by this document (i.e., cd \internet\cheklink D:\INTERNET\CHEKLINK>cheklink When run in standalone mode, the i/o interface is non-graphical -- you are asked to either provide an input file (load_options), or you can enter several parameters form the keyboard. Once CheckLink starts to run, status information (such as what resources are being obtained) is dynamically updated in a text window -- its actually pretty informative, even though it's all textually based! If you are running an OS/2 web server that understands CGI-BIN (most of them do), then you should copy the CHEKLNK2.CMD and CHEKINDX.CMD files to your CGI-BIN scripts directory. The output from CHEKLINK can be instructed to include appropriate calls to CHEKLNK2. In addition, you can use the CHEK_SRE.HTM "front end" to invoke both of these utilities. Thus, to use CheckLink in a non-SREhttp/2 environment, you will a) Run CHEKLINK.CMD, from an OS/2 command prompt, to generate the index of a web-tree, and to produce several tables of results. BE SURE TO SAY Yes when asked: "Use CGI-BIN to specify CHEKLNK2 (web traversal) links?" b) Invoke CHEKLNK2.CMD and CHEKINDX.CMD as CGI-BIN scripts One way to do this is to ... Invoke CHEKLNK2.CMD or CHEKINDX.CMD from CHEK_SRE.HTM -- you'll need to make a few simple modifications to CHEK_SRE.HTM (see CHEK_SRE.HTM for the details) Alternatively, you can run CHEKRPT.CMD as standalone programs. CHEKRPT is not quite as powerful as CHEKLNK2, but it does have a number of nice report writing features, and the HTML documents it produces give you a limited amount of "web tree traversal" opportunities. ------------------- III. CheckLink parameters. Regardless of how you run CheckLink, you may wish to first adjust several performance-tuning and display-customization parameters. Most of these appear at the top of the CHEKLINK.CMD, and there are a few in CHEKLNK2.CMD, CHEKRPT.CMD, and CHEKINDX.CMD -- you should modify these files with your favorite text editor. Note that to use any of the CheckLink programs you do NOT need to set these parameters -- the default values work reasonably well. However, if you intend to make more then occasional use of CheckLink, and you are running CheckLink as a standalone program, we recommend setting the LINKFILE_DIR parameter in CHEKLINK.CMD, CHEKLNK2.CMD, CHEKRPT.CMD, and CHEKINDX.CMD. ------------------- III.b. CHEKLINK, CHEKLNK2, and CHEKINDX parameters BACK_1 : modifiers. BACK_1 are used to set a BGCOLOR (or BACKGROUND) for the CheckLink's output. Examples: BACK_1='bgcolor="#668a78"' CHEKLINK_HTM : URL pointing to CHEK_SRE.HTM CHEKLINK_HTM should contain a URL (usually, a relative URL) that points to the CHEK_SRE.HTM file shipped with CheckLink. This variable is used to add a "generate another web-tree" option to the output file. Thus, neglecting to properly set CHEKLINK_HTM will have minimal deleterious effects. Example: CHEKLINK_HTM = '/SRE2K/SREHTTP2/APPS/CHEKLINK/CHEK_SRE.HTM' CHECK_ROBOT : Suppress checking ROBOTS.TXT. If CHECK_ROBOT=1, then check the "Starter-URL" site for a /robots.txt file, and use it to control extent of search. Proper net'iquette dictates that when checking a stranger's site, make sure you have set CHECK_ROBOT=1. Note: the contents of a ROBOTS.TXT file are added to the special "site-specific" EXCLUSION_LIST -- it only effects URLs on the "Starter-URL" site. Example: CHECK_ROBOT=1 DOUBLE_CHECK: Since servers can be momentarily busy, it's often wise to "double check" busy servers. DOUBLE_CHECK=0 : do NOT double check DOUBLE_CHECK=1 : double check "inaccessible servers" DOUBLE_CHECK=2 : double check "inaccessible servers" AND "missing resources" Double checking will occur after all links have been examined (thus giving the "not available" server a chance to become available. Lastly, GET queries are used (instead of HEAD queries). However, HTML documents retrieved via a double check will NOT be "recursively processed, even if they should have been (even if they had not required this double check). GET_QUERY: As part of mapping a web-tree, CheckLink will query servers for basic information on URLs. These queries are best done with HEAD requests. Unfortunately, there are a number of older servers that do not properly respond to HEAD requests. If you find that CheckLink is identifying many URLs as unavailable (even though your browser can get to them readily), it may be due to their host server's failure to recognize these HEAD requests. As a work around, you can use short GET requests instead of HEAD requests. This method is engaged by setting GET_QUERY=1. Example: GET_QUERY=0 Note: This GET_QUERY=1 method is not highly recommended -- it's slower, and somewhat "ruder" (connections are purposely broken, which tends to add garbage to the visited server's log file). Instead, we recommend setting DOUBLE_CHECK=1 LINKFILE_DIR: directory to store "linkage" files in. Linkage files contain "link" information on all the URLs discovered during CheckLink's recursive mapping of a "web tree". In particular, the LINKFILE option (see section IV) specifies a filename, which will then be stored in the LINKFILE_DIR. By default, LINKFILE_DIR will be your OS/2 TEMP drive. Example: LINKFILE_DIR='D:\GOSERVE\CHKLNKS' Note: in addition to storing LINKFILEs, the LINKFILE_DIR is also used to store "RESULTS" files. MAXATONCE: maximum number of "query" threads Specifies the maximum number of threads to use when checking for the existence (and mimetype) of a link (using HEAD requests). Increasing this number may speed up throughput, but it may subject the target server(s) to excessive loads. Example: MAXATONCE=6 MAXATONCE_GET: maximum number of "read" threads. Specifies the maximum number of threads to use when retrieving the contents of a URL (using GET requests). Increasing this number may speed up throughput, but it may subject the target server(s) to excessive loads. Example: MAXATONCE_GET=2 MAXAGE: Kill a query if it's old Specifies number of seconds to wait on a query (a HEAD request). You may need to increase this time span if sites are far away or otherwise slow. However, increasing MAXAGE will increase the time that CheckLink waits on "hung" sites. Example: MAXAGE=30 MAXAGE2: Kill a read if it's old Specifies number of seconds to wait on a read (a GET request). You may need to increase this time span if sites are far away or otherwise slow. However, increasing MAXAGE will increase the time that CheckLink waits on "hung" sites. Example: MAXAGE2=60 PROXY_SERVER: Specify a proxy server to route request through The proxy server to send http requests through. Use an IP name or numeric address, with optional port. If you are NOT using a proxy server, set this to 0 Examples: PROXY_SERVER='voxy.mycompany.com:8080' PROXY_SERVER=0 ROW_COLOR1 : Used to set the in the results tables ROW_COLOR2 ROW_COLOR1A ROW_COLOR2A ROW_COLOR1 and ROW_COLOR2 set the odd and even rows (respectively) of tables used to display the results of checking IMG links. ROW_COLOR1A and ROW_COLOR2A set the odd and even rows (respectively) of tables used to display the results of checking Anchor links. Examples: ROW_COLOR1='bgcolor="#bbcc66"' ROW_COLOR2='bgcolor="#aaccdd"' ROW_COLOR1A='bgcolor="#bbaa44"' ROW_COLOR2A='bgcolor="#aaccdd"' REMOVE_SCRIPT: Remove