Page Scanner
ZebraTester's Page Scanner function browses and explores web pages of a web server automatically in a recursive way - similar to a Web Spider or a Web Crawler.
Page Scanner's Purpose
Primary: To turn a "normal" web surfing session into a load test program. This provides a simplified way to create a web surfing session instead of recording single web pages manually.
However, Page Scanner can only be used to acquire web surfing sessions that do not require HTML form-based authentication. This tool is not a replacement for recording web surfing sessions of real web applications.
Other: Page Scanner allows the detection of broken links inside a website and provides statistical data about the largest and slowest web pages. It also supports searching for text fragments overall scanned web pages.
Note 1: Page Scanner does not interpret JavaScript code and does not submit forms. Only hyperlinks are considered. Cookies are automatically supported.
Note 2: Page Scanner keeps the entire scanned website in its transient memory (RAM) in compressed form. This means that large websites can be scanned, but it also means that transient memory is not unlimited.
Please note that the Page Scanner tool may return no result or return an incomplete result because some websites or web pages contain malformed HTML code or because old, unusual HTML options have been used within the scanned web pages. Although this tool has been intensively tested, we cannot provide any warranty for error-free behavior. Possible website--or webpage-related errors--may be impossible to fix because of divergent requirements or complexity. The functionality and behavior are similar to other search engines, which also have similar restrictions.
GUI Display
The window is divided into two parts.
Scan Result: The upper part of the window shows the scan's progress or the scan results when it has been completed.
Page Scanner Input Parameter: The lower part of the window allows scan input parameters and starting a scan.
Page Scanner Parameter Inputs
Starting Web Page
The scan starts from this URL. Optionally, scan only parts of a website by entering a deep-linked URL path; for example, http://www.example.com/sales/customers.html
. In this case, only web pages below or at the same level of the URL path are scanned.
Char Encoding
The default value, Auto Detect
, can be overridden in case some or all web pages are wrongly coded, such that the HTML header-specified character set does not match the character set which is actually used within the HTML body of the web pages (malformed HTML at server-side). You can try ISO-8859-1
or UTF
as a workaround if Page Scanner cannot extract hyperlinks (succeeding web pages) from the starting web page.
Exclude Path Patterns
Excludes one or more URL path patterns from scanning. Commas separate the path patterns.
Follow Web Servers
Include content and web pages from other web servers within the scan; for example, images embedded in the web pages located on another web server. Enter several additional web servers, separated by commas. Example: http://www.example.com
, https://imgsrv.example.com:444
. The protocol (HTTP or HTTPS), the hostname (usually www), the domain, and the TCP/IP port are considered, but URL paths are NOT considered.
Verify External Links
Verify all external links to all other web servers. This is commonly used to detect broken hyperlinks to other web servers.
Include
Effects which sets of embedded content types should also be included in the scan. Page Scanner uses the URL paths' file extensions to determine the content type (if available) because this can be done before the hyperlink of the embedded content itself is processed. This saves execution time, but it might affect a few URLs for excluded content types that flow into the result from scanning because the MIME type of the received HTTP response headers is only used in detecting web pages. Remove these unwanted URLs after the scan has been completed using the "remove URL" form in the Display Result window.
Content-Type Sets
Corresponding File Extensions
Images, Flash, CSS, JS
.img
, .bmp
, .gif
, .pct
, .pict
, .png
, .jpg
, .jpeg
, .tif
, .tiff
, .tga
, .ico
, .swf
, .stream
, .css
, .stylesheet
, .js
, .javascript
PDF Documents
.pdf
Office Documents
.doc
, .ppt
, .pps
, .xls
, .mdb
, .wmf
, .rtf
, .wri
, .vsd
, .rtf
, .rtx
ASCII Text Files
.txt
, .text
, .log
, .asc
, .ascii
, .cvs
Music and Movies
.mp2
, .mp3
, .mpg
, .avi
, .wav
, .avi
, .mov
, .wm
, .rm
, .mpeg
Binary Files
.exe
, .msi
, .dll
, .bat
, .com
, .pif
, .dat
, .bin
, .vcd
, .sav
Include Options
Allows you to select or de-select specific file extensions using the keywords -add or -remove.
Example:
-remove .gif -add .mp2
Max Scan Time
Limits the maximum scan time in minutes. The scan will be stopped if this time is exceeded.
Max Web Pages
Limits the maximum number of scanned web pages. The scan will be stopped if the maximum number of web pages is exceeded.
Max Received Bytes
Limits the maximum size of the received data (in megabytes), measured over the entire scan. The scan will be stopped if the maximum size of the received data is exceeded.
Max URL Calls
Limits the maximum number of executed URLcalls, measured over the entire scan. The scan will be stopped if the maximum number of executed URL calls is exceeded.
URL Timeout
Defines the response timeout, in seconds, per single URL call. If this timeout expires, the URLcall will be reported as failed (no response from a web server).
Max Path Depth
Limits the maximum URL path depth of scanned web pages.
Example: http://www.example.com/docs/content/about.html
has a path depth of 3.
Follow Redirections
Limits the total number of followed HTTP redirects during the scan.
Follow Path Repetitions
Limits the number of path repetitions that can occur within a single URL path. This parameter acts as protection against endless loops in scanning and should usually be set to 1 (default) or 2.
Example: http://www.example.com/docs/content/about.html
has a path repetition value of 3.
Follow CGI Parameters
This (by default disabled) option acts as protection against receiving almost identical URLs many times if they differ only in their CGI parameters. If disabled, only the first similar URL will be processed.
For example the first URLhttp://www.example.com/showDoc/context=12
will be processed, but subsequent similar URLs http://www.example.com/showDoc?context=10
and http://www.example.com/showDoc?context=13
, will not be processed.
Browser Language
Sets which default language should be preferred when scanning multilingual websites.
Use Proxy
Apply the Personal Settings menu's Next Proxy Configuration when scanning through an (outgoing) proxy server.
SSL Version
Select the SSL protocol version to communicate with HTTPS servers (encrypted connections).
Annotation
Enter a short comment about the scan.
Authentication
Allows scanning protected websites (or web pages).
Authentication Method
Description
Basic
Apply HTTP Basic Authentication (Base64 encoded username: password send within all HTTP request headers). You should also enter a username and password into the corresponding input fields.
NTLM
Apply NTLM authentication for all URL calls (if requested by the Web server). The NTLM configuration of the Personal Settings menu will be used.
PKCS#12 Client Certificate
Apply an HTTPS/SSL client certificate for authentication. The active PKCS# 12 client certificate of the Personal Settings menu will be used.
Scan Options
Options - Fields
Screenshot
ABORT: You can abort a running scan by clicking on the “Abort Scan” “X“Icon
DISPLAY: Display the scan result
CONVERT Converts the Page Scanner Result into a “normal” Web Surfing Session .prxdat
, creating a load test program for additional ZebraTester actions.
A filename, without path or file extension, is required.
An annotation is recommended to provide a hint in Project Navigator.
Click Convert and Save when ready.
Optionally display the newly converted session in the Main Menu.
Filename
The filename of the web surfing session. You must enter a "simple" filename with no path and no file extension. The file extension is always .prxdat
. The file will be saved in the selected Project Navigator directory.
Web Pages
Selects the scanned web pages which should flow into the web surfing session. “All Pages” means that all scanned web pages are set. Alternatively, the option “Page Ranges” allows you to select one or several ranges of page numbers. If you use several ranges, they must be separated by commas.
Example: "1, 3-5, 7, 38-81"
Max. URL Calls:
Limits the number of URL calls that should flow into the web surfing session. Tip: Apica recommends not converting more than 1,000 URL calls into a web surfing session.
Annotation
Enter a short comment about the web surfing session. This will become a hint in Project Navigator.
Load Session into
Optionally loads the web surfing session into the transient memory area of the Main Menu or one of two memory Scratch Areas of the Session Cutter.
SAVE: When a scan has been completed, save the scan result to a file. The file will be saved in the selected Project Navigator directory and will always have the file extension .prxscn
. Scan results can be restored and loaded back into the Page Scanner by clicking on the corresponding "Load Page Scan" icon inside Project Navigator.
DISCARD
Discards the Scan Results
Analyzing the Scan Result
Section/Form
Screenshot
The most important statistical data about the scan are shown in the summary/overview, near the window's top. Below the overview, select the various scan result details you want to retain/find/filter.
On the right side near the scan result detail selection, the search form allows you to search for an ASCII text fragment overall web pages of the scan result.
By default, the text fragment is searched for within all HTTP request headers, all HTTP response headers, and all HTTP response content data.
The Remove URLs form, shown below the scan result detail selection, allows you to remove specific URLs from the scan result. The set of removed URLs is selected by the received MIME-type (examples: IMAGE/GIF, APPLICATION/PDF, ..), and linked with a logical AND condition with the received HTTP status code for the URLs (200, 302, ..), or with a Page Scanner error code, such as "network connection failed"
with content MIME-type
selects a specific MIME type). The input field is case insensitive (upper and lower case characters will be processed as identical).
any means that all MIME types are selected, independent of their value.
none means that only URL calls whose HTTP response headers do NOT contain MIME type information (HTTP response header field "Content-Type" not set) will be selected.
HTTP status code
selects an HTTP status code or a Page Scanner error code.
Note: A few URLs with excluded content types may flow into the scan result (not selected by scan input parameter). You can use the "remove URL" form to clean up the scan result and remove unwanted URLs. The most common case is to remove PDF documents from the scan result.
The Scan Input Parameter displays all input parameters for the scan (without authentication data).
Scan Statistic displays some additional statistical data about the scan.
Similar Web Pages are the number of web pages with duplicate content (same content but different URL path). Failed URL Calls are the number of URL calls which failed, such that no HTTP status code was available (no response received from a web server), or that the received HTTP status was an error code (400..599).
Non-Processed Web Servers displays a summary of all web servers found in hyperlinks but whose web pages or page elements have not been scanned.
The number before the server name shows the number of times Page Scanner ignored the hyperlink.
Scan Result per Web Page: displays all scanned web pages. A web page's embedded content, such as images, is always displayed in a Web Browser Cached View. For example, this can mean that a particular (unique) image is only shown once inside the web page in which it has been referenced for the first time. All subsequent web pages will not show the same embedded content. This behavior is more or less equal to what a web browser does - it caches duplicate references over all the web pages within a web surfing session.
Broken Links displays a list of all broken hyperlinks.
Duplicated Content displays a list of URLs with duplicate content (same content but different URL path).
or
Largest Web Pages displays a list of the largest web pages.
Tip: Click on any of the bars for the Scan Result per Web Page Details
Slowest Web Pages display a list of the slowest web pages.
Was this helpful?