---------------------------------------------------------------
DSA Release 1.0
User Guide
- by -
R.M. Thomas
May 1995
---------------------------------------------------------------
CONTENTS
1. INTRODUCTION
2. HOW TO INSTALL DSA FOR DOS/OS2
3. PROGRAM DESIGN
4. BASIC COMMAND LINE USAGE
5. TRUNCATION
6. PHRASE STATISTICS
7. SOME OTHER COMMAND LINE SWITCHES
8. MEMORY-MANAGEMENT SWITCHES
9. IMPORT OF DATA
10. UNFORMATTED TEMPORARY FILES
11. OTHER OUTPUT FILES
12. SUMMARY OF SYNTAX
---------------------------------------------------------------
1. INTRODUCTION
DSA is a program which performs statistical analyses of the
frequency of words and phrases in machine-readable text. The
user specifies the required type of analysis and the form of
the output by means of command-line parameters and/or indirect
command files. By default, truncation of words is performed
using inbuilt stemming algorithms (presently two alternatives
for English, plus one each for German and French). Each
grammatical sentence can optionally be treated as (overlapping)
sets of up to five adjacent words; the statistical analysis is
then performed on the resulting phrases rather than the
individual words.
A special feature of DSA is that it allows efficient analysis
of the statistical differences in word frequency between two
files. The user can specify that words in the first file are
to be read and stored as a dynamic stopword list; when the
second file is read, statistical analysis is done only on those
words not appearing (or rarely appearing) in the first file.
There is no limit on the size of the first file which defines
the stopword set; moreover, by means of an indirect command
file it is possible to specify a set of input files, rather
than a single file, as the source of stopwords. This
capability, combined with the corresponding possibility of
replacing the second single file by a list of filenames, permits
flexible and sophisticated differential statistical analyses of
plural texts. To facilitate optimisation of run-time
performance on various hardware platforms, there are
command-line options permitting the user considerable control
over both memory allocation and the deployment of temporary
files for intermediate analysis steps.
DSA is intended as an instrument for researching the
effectiveness of approaches to information retrieval based on
word-frequency statistics. Published literature suggests that
statistical methods are sometimes successful, but quantitative
confirmation is often lacking.
The program is currently supplied as executables for OS2, for
DOS with conventional memory only, and for DOS making use
of the GNU 32-bit DOS extender. An SCO-UNIX executable is
available on request. All source code is included in the
software distribution.
2. HOW TO INSTALL DSA FOR DOS/OS2
Copy all files, including all subdirectories, from the
installation diskette to a suitably named directory on the
hard disk. For example, under OS2:
D:
mkdir dsa
cd dsa
xcopy a:\*.* . /s /e /v
The directory D:\DSA may then be added to the search path by
modifying AUTOEXEC.BAT or STARTUP.CMD as appropriate.
3. PROGRAM DESIGN
The basic design of DSA is illustrated in Figure 1. There are
three data structures called indexes, which may be regarded as
alphabetically ordered lists of words; each of these words is
accompanied by a number which represents its frequency, that
is, the number of times that this word has occurred in the input
text since the index was initialised. For reasons which are
unimportant here, the three indexes are referred to as Tree 0,
Tree 1 and Tree 2. Initially it is helpful to think of Tree
0 and Tree 1 as being of approximately the same size, whereas
Tree 2 is much smaller. Tree 2 has only one purpose: to
hold a small list of conventional stopwords such as "the",
"but" and so on. Tree 1 has two primary functions: first,
to hold a (possibly very large) list of stopwords taken from
a text file or set of files; and, secondly, to serve as the
temporary working storage space which is required for the step
of generating output in frequency-ranked format: that is, with
the most commonly occurring words at the top of the list.
Tree 0 holds the words taken from the file, or set of files,
for which an analysis is currently in progress. New words are
admitted to Tree 0 only after checking that they do not occur
(or occur only rarely) in either Tree 1 or Tree 2.
4. BASIC COMMAND LINE USAGE
Figure 1 shows a simple way of using DSA. First, a short list
of stopwords is imported from file STOP.EN into Tree 2. A file
EXCLUDE.1 is then read, its individual words being entered into
Tree 1 (provided they do not already appear in Tree 2). Finally
the subject file ANALYSE.1 is read into Tree 0, words found in
Tree 1 or Tree 2 being excluded; when the end of the file is
reached, the contents of Tree 0 are written out in alphabetical
order to file ALP and in frequency-ranked format to file FRE.
The basic command line for achieving this sequence of
operations is as follows:
dsa !EXCLUDE.1 ANALYSE.1 #ALP ##FRE
The exclamation mark prefixing the first filename specifies that
the text is to be read into Tree 1; in the absence of such a
mark, the contents of a file are read into Tree 0. It is not
possible to associate an input filename with Tree 2: its
stopwords are always read from a file named STOP.EN, STOP.DE or
STOP.FR; by default, DSA creates a suitable stopword file with
one of these names if the file does not already exist.
The hash mark prefixing the output filename ALP indicates that a
simple alphabetical listing of words is required. This
operation does not disturb the data in any of the three trees.
The double hash mark prefixing the output filename FRE calls
for a frequency-ranked output of the contents of Tree 0, and it
is important for the user to understand that, by default, this
destroys any existing data in Tree 1. In the present example,
this loss of Tree 1 is of no importance, since there is no
subsequent operation referencing it; in complex sequences of
statistical analyses, however, it may be necessary to save the
data in Tree 1 before specifying ##FRE, as will be discussed
below.
Several input files for both Tree 0 and Tree 1 may appear on
the command line:
dsa !EXCLUDE.1 !EXCLUDE.2 ANALYSE.1 ANALYSE.2 #ALP ##FRE
In such a case the files EXCLUDE.1 and EXCLUDE.2 are read
sequentially into Tree 1, that is, as if the two files had
been concatenated before being read. Similarly, the files
ANALYSE.1 and ANALYSE.2 are read into Tree 0, excluding words
appearing in either EXCLUDE.1 or EXCLUDE.2. Notice that the
order in which filenames appear on the command line is highly
significant. Thus the command
dsa !EXCLUDE.1 ANALYSE.1 !EXCLUDE.2 ANALYSE.2 #ALP ##FRE
is different from the previous command: this time, Tree 0 will
eventually hold words from ANALYSE.1 which do not occur in file
EXCLUDE.1, together with words from ANALYSE.2 which occur in
neither EXCLUDE.1 nor EXCLUDE.2. Similarly, the following
command
dsa !EXCLUDE.1 ANALYSE.1 #ALP !EXCLUDE.2 ANALYSE.2 ##FRE
is valid, but has a different meaning: the file ALP will now
contain an alphabetical list of words from ANALYSE.1 not
occurring in EXCLUDE.1, while the contents of the output file
FRE will be the same as for the previous command above.
If there are more than a few input files, it not convenient to
list them all on the command line. In such a case, the
following syntax is appropriate:
dsa !@FILE1 @FILE0 #ALP ##FRE
where the file named FILE1 contains the two lines
EXCLUDE.1
EXCLUDE.2
and the file named FILE0 contains the two lines
ANALYSE.1
ANALYSE.2
This command, which indirectly references the names of the
input files, is entirely equivalent to the direct command
dsa !EXCLUDE.1 !EXCLUDE.2 ANALYSE.1 ANALYSE.2 #ALP ##FRE
Files such as FILE0 and FILE1 are called indirect command files.
This technique may also be applied to output files (although it
is usually unnecessary in practice):
dsa !@FILE1 @FILE0 @FILEOUT
where the file named FILEOUT contains the two lines
#ALP
##FRE
Indeed, it is possible to put all the filenames into a single
file:
dsa @FILE3
where the file named FILE3 has six lines:
!EXCLUDE.1
!EXCLUDE.2
ANALYSE.1
ANALYSE.2
#ALP
##FRE
It is even possible for one (or more) or the files in such a
list to be itself a list of filenames:
dsa @FILE4
where the file named FILE4 has four lines:
@!FILE5
@FILE6
#ALP
##FRE
Here the file named FILE5 has two lines:
EXCLUDE.1
EXCLUDE.2
and the file named FILE6 also has two lines:
ANALYSE.1
ANALYSE.2
Clearly, in general many permutations are possible. The user
is free to arrange the indirect referencing of filenames to
achieve maximum convenience. Seven levels of indirection are
supported. When constructing complex indirect command file
hierarchies, it is necessary to note that the exclamation mark
in fact acts as a toggle between Tree 0 and Tree 1. For
example, consider the command
dsa @!FILE7 ##FRE
where the file FILE7 contains the three lines:
EXCLUDE.1
!ANALYSE.1
This is equivalent to the direct command:
dsa !EXCLUDE.1 ANALYSE.1 ##FRE
Whenever a file prefix comprises more than one character, these
characters may be in any order. Thus the two following commands
are equivalent:
dsa @!FILE7 ##FRE
dsa !@FILE7 ##FRE
5. TRUNCATION
Characters read by DSA from an input file are always filtered
in order to replace any non-ASCII characters by the nearest
equivalent ASCII character (0-7F hex), on the assumption that
Codepage 850 applies. This procedure is uncontroversial, except
possibly for the case of the character 'á', which is replaced by
'Z'; for details, see companion documentation "DSA Release 1.0
Annotated Source Listing". By default, words in the input file
are then subjected to an inbuilt truncation algorithm. Control
of this step is provided by command line switches. For example,
dsa ANALYSE.1 ##FRE
will result in the application of the default truncation
algorithm for English text. The command
dsa ANALYSE.1 ##FRE /T0
will suppress truncation completely, whereas the command
dsa ANALYSE.1 ##FRE /FR
will lead to the application of the default truncation
algorithm for French text; at the same time, this switch
forces use of the stopwords in the file STOP.FR instead of the
usual STOP.EN. To avoid using any stopwords, another switch
is available:
dsa ANALYSE.1 ##FRE /C
Switches can be combined: to invoke the default truncation
algorithms for German text but at the same time avoid using any
stopwords (which would otherwise be taken from file STOP.DE,
created if not existing), the appropriate command is:
dsa ANALYSE.1 ##FRE /C /DE
Unlike filenames, switches can appear anywhere in the command
line, and in any order. Switches cannot, however, be placed in
indirect command files. A full list of the switches available
in DSA Release 1.0 is given in Section 12 below. Those relating
to truncation and particularly the German and French options are
still under development and are expected to change in future
releases: please consult the documentation updates (if any)
on the distribution diskette.
6. PHRASE STATISTICS
Suppose the file named FOX.TXT contains the single sentence:
"The quick brown fox jumps over the lazy dog."
Assuming the only stopword to be "the", the command
dsa FOX.TXT #ALP
gives the following alphabetical output:
1 BROWN
1 DOG
1 FOX
1 JUMP+
1 LAZY
1 OVER
1 QUICK
where the '+' symbol attached to "JUMP" indicates that
truncation has been performed.
Suppose now that we are interested in the statistics of
three-word phrases rather than individual words. The command:
dsa FOX.TXT #ALP /3
gives the output:
1 BROWN_FOX_JUMP+
1 FOX_JUMP+_OVER
1 OVER_THE_LAZY
1 QUICK_BROWN_FOX
When DSA is used for extracting phrase statistics in this way,
there is exclusion of phrases which either begin or end with
any stopword in Tree 2; however, embedded stopwords have no
effect. Thus, in the above example, "OVER_THE_LAZY" does
appear, whereas "THE_QUICK_BROWN" has been excluded.
A maximum of five words can combined into indexed phrases.
DSA is sufficiently intelligent to avoid the building of phrases
across sentence boundaries. For example, consider an input file
containing the text:
"The quick brown fox jumps over the lazy dog. One
cannot teach old dogs new tricks."
The same command as above then gives:
1 BROWN_FOX_JUMP+
1 CANNOT_TEACH_OLD
1 DOGS_NEW_TRICK+
1 FOX_JUMP+_OVER
1 OLD_DOGS_NEW
1 ONE_CANNOT_TEACH
1 OVER_THE_LAZY
1 QUICK_BROWN_FOX
1 TEACH_OLD_DOGS
and it should be noticed that the phrases "LAZY_DOG_ONE" and
"DOG_ONE_CANNOT" do not appear, because they span a full stop.
Lesser punctuation, for example a comma, semicolon or colon has
the same effect as a full stop; a blank line in the the text is
also treated as equivalent to punctuation, and will prevent
phrase construction.
7. SOME OTHER COMMAND LINE SWITCHES
By default, the frequency-ranked output does not include words
which occur only once in the text. Inclusion of the singleton
words can be forced by means of a switch:
dsa ANALYSE.1 ##FRE /A
Strings beginning with a digit, such as "941225", are normally
ignored. A switch is available to force their recognition:
dsa ANALYSE.1 ##FRE /B
An important switch permits multistage statistical analysis
by saving the data from Tree 1 prior to a frequency-ranked
output step and then restoring it to Tree 1 afterwards. For
example, the command
dsa !EXCLUDE.1 ANALYSE.1 ##FRE ANALYSE.2 #ALP /R
reads from file EXCLUDE.1 into Tree 1, uses the resulting set
of stopwords while reading from file ANALYSE.1 into Tree 0,
writes frequency-ranked output to file FRE, then reads from file
ANALYSE.2, adding to the data already in Tree 0 while again
excluding words previously read from file EXCLUDE.1, and finally
writes an alphabetical word list to file ALP. The switch "/R"
ensures that the stopwords in Tree 1 are preserved following
the ##FRE step.
In the examples given hitherto, the presence of a stopword in
Tree 1 denies this word access to Tree 0. Sometimes it is
useful to require that access to Tree 0 is denied only if the
frequency associated with the word in Tree 1 is above some
specified level. A series of six switches provide this
possibility. For example, the command
dsa !EXCLUDE.1 ANALYSE.1 ##FRE /P3
has the effect of excluding from Tree 0 those words which have a
relative frequency in Tree 1 higher than 0.01%.
The switch /Q reverses the logic controlling the exclusion of
words (or phrases) from Tree 0: words that would have been
excluded from Tree 0 in the absence of the /Q switch are now
admitted, and vice versa.
8. MEMORY-MANAGEMENT SWITCHES
By default, DSA uses the maximum amount of available memory.
Although the program has procedures implementing virtual
memory management, run-time performance inevitably deteriorates
rapidly when disk accesses become frequent. Normally the user
can confidently rely on the algorithms built into DSA to utilise
the real RAM on any particular hardware in an efficient way.
Occasionally, however, there may be some advantage in being able
to control the memory management manually, and various switches
are provided for this purpose.
As a first example, suppose Tree 1 is not required, because
there is a single input file and no frequency ranked output is
wanted. By default, DSA always divides the available memory
equally between Tree 0 and Tree 1, regardless of the fact that
only a single filename occurs on the command line. To avoid
wasting half the RAM, a switch can be used:
dsa EXCLUDE.1 #ALP /N0
which gives all the available memory to Tree 0. On the other
hand, it may be advantageous to give Tree 1 more than its usual
share of memory:
dsa !VERY.BIG TINY.ONE ##FRE /N3
Another important case is that in which DSA must multitask with
other software on a machine with limited RAM. A switch
restricts to a minimum the memory demanded by DSA:
dsa ANALYSE.1 #ALP /M3
As usual, switches can be combined, permitting fine-tuning of
the memory requirements:
dsa ANALYSE.1 #ALP /M3 /N0
The various memory-management switches are defined in more
detail in Section 12 below.
9. IMPORT OF DATA
As explained above, the command
dsa FOX.TXT #ALP ##FRE
generates two output files in a standard format. It is
possible to use these files as the input for a subsequent run
of DSA, a procedure referred to as the import of data.
Thus the command
dsa _ALP DUCK.TXT #NEW
causes the file ALP to be read into Tree 0. File DUCK.TXT is
then read, and its contents added to Tree 0. Finally, an
alphabetical list combining words from both ALP and DUCK.TXT
is written to the output file NEW. Another example is
provided by the command
dsa !_ALP DUCK.TXT #NEW
which causes the words in the file ALP to be read into Tree 1,
where they would act as stopwords for the analysis of the file
DUCK.TXT.
In the examples above, the imported words are added to the
existing contents of the tree. If it wished to clear the tree
before importing data from a file, the following form of
command may be used:
dsa CAT.TXT ##FRE __ALP #NEW
In the case, the data read from CAT.TXT into Tree 0 is discarded
before the import from file ALP occurs.
It is not necessary for the import files (ALP in the above
examples) to have been generated by DSA: the user can use any
other convenient program to create them. The required format of
each record in the file is:
decimal_number_string character_string CRLF
where CRLF denotes the pair of ascii characters signifying the
end of a line. The decimal number must be greater than zero
and less than one thousand million. The character string must
have the form used by the DSA program in the trees: it must
comprise contiguous alphanumeric characters (no whitespace),
plus optionally up to four embedded '_' characters, each
optionally proceeded by a '+' character; a '+' character is
also optional at the very end of the string. An input file
illustrating both the simplest and a more complex case might be:
1 NMR
999999999 NUCL+_MAGNET+_RESONAN+_IMAG+_COIL+
The whitespace which precedes the decimal number and separates
it from the alphanumeric string is of arbitrary length: the
data is read in a way which is insensitive to column position.
The records in import files may not exceed 80 characters in
length.
10. SAVE AND UNSAVE: UNFORMATTED TEMPORARY FILES
It is sometimes necessary to save the contents of Tree 0 or
Tree 1 temporarily in an unformatted disk file, and to restore
the tree from the file at a later stage in the same run of the
DSA program. This is done automatically when the /R switch is
specified in order to restore the contents of Tree 1 after the
tree has been used for generating a frequency-ranked output file
(see Section 7 above), and users will usually not need to
concern themselves with manipulating individual unformatted
files. If direct control of data transfer between the trees and
the disk is considered essential, the required syntax is
exemplified by the command
dsa !EXCLUDE.1 ANALYSE.1 }!TEMP ##FRE {!TEMP
The contents of Tree 1 are written to the unformatted files
TEMP.TOK and TEMP.NDX prior to the generation of the frequency-
ranked formatted output in file FRE; the contents of Tree 1 are
subsequently unsaved from the same files. Notice that the
chosen filename (in this example TEMP) must have no extension,
since the program DSA adds the required .TOK and .NDX in order
to generate two separate filenames. The above command is
entirely equivalent to the simpler
dsa !EXCLUDE.1 ANALYSE.1 ##FRE /R
(In a more realistic example, the restored contents of Tree 1
would be used as stopwords for a further statistical analysis
step.)
It is also possible to save the contents of Tree 0: consider
the (rather pointless) command formulation
dsa ANALYSE.1 }TEMP.UNF ANALYSE.2 #ALP {TEMP.UNF ##FRE
The result is that ALP contains an alphabetically ordered list
of the words from both ANALYSE.1 and ANALYSE.2, whereas FRE
contains a frequency-ranked list of words taken from ANALYSE.1
alone.
Unformatted files can be used in this way only during a single
run of the DSA program. An error exit will occur if any attempt
is made to unsave an unformatted file which was created during
a previous run.
11. OTHER OUTPUT FILES
In addition to the various input and output files already
discussed above, the DSA program generates various other files.
There are six temporary files, three named *.TOK and three
named *.NDX, which are deleted on exit unless there has been a
serious run-time error.
More interesting is the file DSA.LOG, which contains information
on the way in which the run progressed, in particular a list of
the files read and written. If the /V or /VV switch has been
specified on the command line, the logfile will also give
technical details relating to the mechanics of the run, for
example the amount of memory allocated. This can be useful when
developing a command strategy to optimise run-time speed or to
allow unobtrusive background running.
Finally, the program DSA will sometimes generate a file named
DSA.OUT. This occurs only if the user has inadvertently failed
to specify any output filename on the command line. For
example, the (accidentally incomplete) command
dsa ANALYSE.1
will cause an alphabetical list of words to be written to the
default output file DSA.OUT.
---------------------------------------------------------------
12. SUMMARY OF SYNTAX
Filename prefixes:
name.ext Read from file name.ext into Tree 0
!name.ext Read from file name.ext into Tree 1
@name.ext Read list of filenames from name.ext
~name.ext Read list of filenames from name.ext (for GNU)
#name.ext Write an alphabetical wordlist (for DOS/OS2)
##name.ext Write a frequency-ranked wordlist (for DOS/OS2)
%name.ext Write an alphabetical wordlist (for UNIX)
%%name.ext Write a frequency-ranked wordlist (for UNIX)
_name.ext Import from formatted file name.ext
__name.ext Import from formatted file name.ext, after clearing
}name Save to unformatted files name.tok and name.ndx
{name Unsave from unformatted files name.tok and name.ndx
Command line switches:
/A export also those strings occurring only once in text
/B recognise strings beginning with a digit
/C ignore all stop words
/DE use German stopwords and stemming [subject to /C and /T0]
/FR use French stopwords and stemming [subject to /C and /T0]
/M1 use maximum available dynamic memory [default]
/M2 use half of available dynamic memory
/M3 use one third of available dynamic memory
/N0 give Tree 1 no memory
/N1 give Tree 1 the same amount of memory as Tree 0 [default]
/N2 give Tree 1 twice as much memory as Tree 0
/N3 give Tree 1 three times as much memory as Tree 0
/P0 exclude all words occurring in Tree 1 [default]
/P1 exclude words at Tree-1 relative frequency > 0.000001
/P2 exclude words at Tree-1 relative frequency > 0.00001
/P3 exclude words at Tree-1 relative frequency > 0.0001
/P4 exclude words at Tree-1 relative frequency > 0.001
/P5 exclude words at Tree-1 relative frequency > 0.01
/P6 exclude words at Tree-1 relative frequency > 0.1
/Q reverse logic for excluding words
/R restore contents of Tree 1 (after frequency-ranking step)
/T0 stemming: none
/T1 stemming: inbuilt algorithm [default]
/T2 stemming: M.F. Porter's algorithm (English only)
/V verbose log file
/VV very verbose log file
/1 statistics of individual words [default]
/2 statistics of 2-word phrases
/3 statistics of 3-word phrases
/4 statistics of 4-word phrases
/5 statistics of 5-word phrases
Reserved filenames: dsa.log, dsa.out, stop.en, stop.de, stop.fr