---------------------------------------------------------------
                               DSA  Release  1.0
                                  User  Guide
                                    - by -
                                  R.M. Thomas
                                                              May 1995
        ---------------------------------------------------------------
                                   CONTENTS

                      1.  INTRODUCTION
                      2.  HOW TO INSTALL DSA FOR DOS/OS2
                      3.  PROGRAM DESIGN
                      4.  BASIC COMMAND LINE USAGE
                      5.  TRUNCATION
                      6.  PHRASE STATISTICS
                      7.  SOME OTHER COMMAND LINE SWITCHES
                      8.  MEMORY-MANAGEMENT SWITCHES
                      9.  IMPORT OF DATA
                     10.  UNFORMATTED TEMPORARY FILES
                     11.  OTHER OUTPUT FILES
                     12.  SUMMARY OF SYNTAX
        ---------------------------------------------------------------

        1.  INTRODUCTION


        DSA is a program which performs statistical analyses of the
        frequency of words and phrases in machine-readable text.  The
        user specifies the required type of analysis and the form of
        the output by means of command-line parameters and/or indirect
        command files.  By default, truncation of words is performed
        using inbuilt stemming algorithms (presently two alternatives
        for English, plus one each for German and French).  Each
        grammatical sentence can optionally be treated as (overlapping)
        sets of up to five adjacent words;  the statistical analysis is
        then performed on the resulting phrases rather than the
        individual words.

        A special feature of DSA is that it allows efficient analysis
        of the statistical differences in word frequency between two
        files.  The user can specify that words in the first file are
        to be read and stored as a dynamic stopword list;  when the
        second file is read, statistical analysis is done only on those
        words not appearing (or rarely appearing) in the first file.
        There is no limit on the size of the first file which defines
        the stopword set;  moreover, by means of an indirect command
        file it is possible to specify a set of input files, rather
        than a single file, as the source of stopwords.  This
        capability, combined with the corresponding possibility of
        replacing the second single file by a list of filenames, permits
        flexible and sophisticated differential statistical analyses of
        plural texts.  To facilitate optimisation of run-time
        performance on various hardware platforms, there are
        command-line options permitting the user considerable control
        over both memory allocation and the deployment of temporary
        files for intermediate analysis steps.

        DSA is intended as an instrument for researching the
        effectiveness of approaches to information retrieval based on
        word-frequency statistics.  Published literature suggests that
        statistical methods are sometimes successful, but quantitative
        confirmation is often lacking.

        The program is currently supplied as executables for OS2, for
        DOS with conventional memory only, and for DOS making use
        of the GNU 32-bit DOS extender.  An SCO-UNIX executable is
        available on request.  All source code is included in the
        software distribution.



        2.  HOW TO INSTALL DSA FOR DOS/OS2


        Copy all files, including all subdirectories, from the
        installation diskette to a suitably named directory on the
        hard disk.  For example, under OS2:

           D:
           mkdir dsa
           cd dsa
           xcopy a:\*.* . /s /e /v

        The directory D:\DSA may then be added to the search path by
        modifying AUTOEXEC.BAT or STARTUP.CMD as appropriate.




        3.  PROGRAM DESIGN
        
        
        The basic design of DSA is illustrated in Figure 1.  There are
        three data structures called indexes, which may be regarded as
        alphabetically ordered lists of words;  each of these words is
        accompanied by a number which represents its frequency, that
        is, the number of times that this word has occurred in the input
        text since the index was initialised.  For reasons which are
        unimportant here, the three indexes are referred to as Tree 0,
        Tree 1 and Tree 2.  Initially it is helpful to think of Tree
        0 and Tree 1 as being of approximately the same size, whereas
        Tree 2 is much smaller.  Tree 2 has only one purpose:  to
        hold a small list of conventional stopwords such as "the",
        "but" and so on.  Tree 1 has two primary functions:  first,
        to hold a (possibly very large) list of stopwords taken from
        a text file or set of files;  and, secondly, to serve as the
        temporary working storage space which is required for the step
        of generating output in frequency-ranked format:  that is, with
        the most commonly occurring words at the top of the list.
        Tree 0 holds the words taken from the file, or set of files,
        for which an analysis is currently in progress.  New words are
        admitted to Tree 0 only after checking that they do not occur
        (or occur only rarely) in either Tree 1 or Tree 2.



        4.  BASIC COMMAND LINE USAGE
 
       
        Figure 1 shows a simple way of using DSA.  First, a short list
        of stopwords is imported from file STOP.EN into Tree 2.  A file
        EXCLUDE.1 is then read, its individual words being entered into
        Tree 1 (provided they do not already appear in Tree 2).  Finally
        the subject file ANALYSE.1 is read into Tree 0, words found in
        Tree 1 or Tree 2 being excluded;  when the end of the file is
        reached, the contents of Tree 0 are written out in alphabetical
        order to file ALP and in frequency-ranked format to file FRE.

        The basic command line for achieving this sequence of
        operations is as follows:

                   dsa !EXCLUDE.1 ANALYSE.1 #ALP ##FRE

        The exclamation mark prefixing the first filename specifies that
        the text is to be read into Tree 1;  in the absence of such a
        mark, the contents of a file are read into Tree 0.  It is not
        possible to associate an input filename with Tree 2:  its
        stopwords are always read from a file named STOP.EN, STOP.DE or
        STOP.FR;  by default, DSA creates a suitable stopword file with
        one of these names if the file does not already exist.

        The hash mark prefixing the output filename ALP indicates that a
        simple alphabetical listing of words is required.  This
        operation does not disturb the data in any of the three trees.
        The double hash mark prefixing the output filename FRE calls
        for a frequency-ranked output of the contents of Tree 0, and it
        is important for the user to understand that, by default, this
        destroys any existing data in Tree 1.  In the present example,
        this loss of Tree 1 is of no importance, since there is no
        subsequent operation referencing it;  in complex sequences of
        statistical analyses, however, it may be necessary to save the
        data in Tree 1 before specifying ##FRE, as will be discussed
        below.

        Several input files for both Tree 0 and Tree 1 may appear on
        the command line:

           dsa !EXCLUDE.1 !EXCLUDE.2 ANALYSE.1 ANALYSE.2 #ALP ##FRE

        In such a case the files EXCLUDE.1 and EXCLUDE.2 are read
        sequentially into Tree 1, that is, as if the two files had
        been concatenated before being read.  Similarly, the files
        ANALYSE.1 and ANALYSE.2 are read into Tree 0, excluding words
        appearing in either EXCLUDE.1 or EXCLUDE.2.  Notice that the
        order in which filenames appear on the command line is highly
        significant.  Thus the command

           dsa !EXCLUDE.1 ANALYSE.1 !EXCLUDE.2 ANALYSE.2 #ALP ##FRE

        is different from the previous command:  this time, Tree 0 will
        eventually hold words from ANALYSE.1 which do not occur in file
        EXCLUDE.1, together with words from ANALYSE.2 which occur in
        neither EXCLUDE.1 nor EXCLUDE.2.  Similarly, the following
        command

           dsa !EXCLUDE.1 ANALYSE.1 #ALP !EXCLUDE.2 ANALYSE.2 ##FRE

        is valid, but has a different meaning:  the file ALP will now
        contain an alphabetical list of words from ANALYSE.1 not
        occurring in EXCLUDE.1, while the contents of the output file
        FRE will be the same as for the previous command above.

        If there are more than a few input files, it not convenient to
        list them all on the command line.  In such a case, the
        following syntax is appropriate:

           dsa !@FILE1 @FILE0 #ALP ##FRE

        where the file named FILE1 contains the two lines

        EXCLUDE.1
        EXCLUDE.2

        and the file named FILE0 contains the two lines

        ANALYSE.1
        ANALYSE.2

        This command, which indirectly references the names of the
        input files, is entirely equivalent to the direct command

           dsa !EXCLUDE.1 !EXCLUDE.2 ANALYSE.1 ANALYSE.2 #ALP ##FRE

        Files such as FILE0 and FILE1 are called indirect command files.
        This technique may also be applied to output files (although it
        is usually unnecessary in practice):

          dsa !@FILE1 @FILE0 @FILEOUT

        where the file named FILEOUT contains the two lines

        #ALP
        ##FRE

        Indeed, it is possible to put all the filenames into a single
        file:

           dsa @FILE3

        where the file named FILE3 has six lines:

        !EXCLUDE.1
        !EXCLUDE.2
        ANALYSE.1
        ANALYSE.2
        #ALP
        ##FRE

        It is even possible for one (or more) or the files in such a
        list to be itself a list of filenames:

           dsa @FILE4

        where the file named FILE4 has four lines:

        @!FILE5
        @FILE6
        #ALP
        ##FRE

        Here the file named FILE5 has two lines:

        EXCLUDE.1
        EXCLUDE.2

        and the file named FILE6 also has two lines:

        ANALYSE.1
        ANALYSE.2

        Clearly, in general many permutations are possible.  The user
        is free to arrange the indirect referencing of filenames to
        achieve maximum convenience.  Seven levels of indirection are
        supported.  When constructing complex indirect command file
        hierarchies, it is necessary to note that the exclamation mark
        in fact acts as a toggle between Tree 0 and Tree 1.  For
        example, consider the command

           dsa @!FILE7 ##FRE

        where the file FILE7 contains the three lines:

        EXCLUDE.1
        !ANALYSE.1
       
        This is equivalent to the direct command:

            dsa !EXCLUDE.1 ANALYSE.1 ##FRE

        Whenever a file prefix comprises more than one character, these
        characters may be in any order.  Thus the two following commands
        are equivalent:

           dsa @!FILE7 ##FRE
           dsa !@FILE7 ##FRE



        5.   TRUNCATION


        Characters read by DSA from an input file are always filtered
        in order to replace any non-ASCII characters by the nearest
        equivalent ASCII character (0-7F hex), on the assumption that
        Codepage 850 applies.  This procedure is uncontroversial, except
        possibly for the case of the character 'á', which is replaced by
        'Z';  for details, see companion documentation "DSA Release 1.0
        Annotated Source Listing".  By default, words in the input file
        are then subjected to an inbuilt truncation algorithm.  Control
        of this step is provided by command line switches.  For example,

           dsa ANALYSE.1 ##FRE

        will result in the application of the default truncation
        algorithm for English text.  The command

           dsa ANALYSE.1 ##FRE /T0

        will suppress truncation completely, whereas the command

           dsa ANALYSE.1 ##FRE /FR

        will lead to the application of the default truncation
        algorithm for French text;  at the same time, this switch
        forces use of the stopwords in the file STOP.FR instead of the
        usual STOP.EN.   To avoid using any stopwords, another switch
        is available:

           dsa ANALYSE.1 ##FRE /C

        Switches can be combined:  to invoke the default truncation
        algorithms for German text but at the same time avoid using any
        stopwords (which would otherwise be taken from file STOP.DE,
        created if not existing), the appropriate command is:

           dsa ANALYSE.1 ##FRE /C /DE

        Unlike filenames, switches can appear anywhere in the command
        line, and in any order.  Switches cannot, however, be placed in
        indirect command files.  A full list of the switches available
        in DSA Release 1.0 is given in Section 12 below.  Those relating
        to truncation and particularly the German and French options are
        still under development and are expected to change in future
        releases:  please consult the documentation updates (if any)
        on the distribution diskette.



        6.   PHRASE STATISTICS


        Suppose the file named FOX.TXT contains the single sentence:

             "The quick brown fox jumps over the lazy dog."

        Assuming the only stopword to be "the", the command

           dsa FOX.TXT #ALP

        gives the following alphabetical output:

                   1    BROWN
                   1    DOG
                   1    FOX
                   1    JUMP+
                   1    LAZY
                   1    OVER
                   1    QUICK

        where the '+' symbol attached to "JUMP" indicates that
        truncation has been performed.

        Suppose now that we are interested in the statistics of
        three-word phrases rather than individual words.  The command:

            dsa FOX.TXT #ALP /3

        gives the output:

                   1    BROWN_FOX_JUMP+
                   1    FOX_JUMP+_OVER
                   1    OVER_THE_LAZY
                   1    QUICK_BROWN_FOX

        When DSA is used for extracting phrase statistics in this way,
        there is exclusion of phrases which either begin or end with
        any stopword in Tree 2;  however, embedded stopwords have no
        effect.  Thus, in the above example, "OVER_THE_LAZY" does
        appear, whereas "THE_QUICK_BROWN" has been excluded.

        A maximum of five words can combined into indexed phrases.

        DSA is sufficiently intelligent to avoid the building of phrases
        across sentence boundaries.  For example, consider an input file
        containing the text:

            "The quick brown fox jumps over the lazy dog.  One
             cannot teach old dogs new tricks."

        The same command as above then gives:

                   1    BROWN_FOX_JUMP+
                   1    CANNOT_TEACH_OLD
                   1    DOGS_NEW_TRICK+
                   1    FOX_JUMP+_OVER
                   1    OLD_DOGS_NEW
                   1    ONE_CANNOT_TEACH
                   1    OVER_THE_LAZY
                   1    QUICK_BROWN_FOX
                   1    TEACH_OLD_DOGS

        and it should be noticed that the phrases "LAZY_DOG_ONE" and
        "DOG_ONE_CANNOT" do not appear, because they span a full stop.
        Lesser punctuation, for example a comma, semicolon or colon has
        the same effect as a full stop;  a blank line in the the text is
        also treated as equivalent to punctuation, and will prevent
        phrase construction.



        7.  SOME OTHER COMMAND LINE SWITCHES


        By default, the frequency-ranked output does not include words
        which occur only once in the text.  Inclusion of the singleton
        words can be forced by means of a switch:

           dsa ANALYSE.1 ##FRE /A

        Strings beginning with a digit, such as "941225", are normally
        ignored.  A switch is available to force their recognition:

           dsa ANALYSE.1 ##FRE /B

        An important switch permits multistage statistical analysis
        by saving the data from Tree 1 prior to a frequency-ranked
        output step and then restoring it to Tree 1 afterwards.  For
        example, the command

           dsa !EXCLUDE.1 ANALYSE.1 ##FRE ANALYSE.2 #ALP /R

        reads from file EXCLUDE.1 into Tree 1, uses the resulting set
        of stopwords while reading from file ANALYSE.1 into Tree 0,
        writes frequency-ranked output to file FRE, then reads from file
        ANALYSE.2, adding to the data already in Tree 0 while again        
        excluding words previously read from file EXCLUDE.1, and finally
        writes an alphabetical word list to file ALP.  The switch "/R"
        ensures that the stopwords in Tree 1 are preserved following
        the ##FRE step.

        In the examples given hitherto, the presence of a stopword in
        Tree 1 denies this word access to Tree 0.  Sometimes it is
        useful to require that access to Tree 0 is denied only if the
        frequency associated with the word in Tree 1 is above some
        specified level.  A series of six switches provide this
        possibility.  For example, the command

            dsa !EXCLUDE.1 ANALYSE.1 ##FRE /P3

        has the effect of excluding from Tree 0 those words which have a
        relative frequency in Tree 1 higher than 0.01%.

        The switch /Q reverses the logic controlling the exclusion of
        words (or phrases) from Tree 0:  words that would have been
        excluded from Tree 0 in the absence of the /Q switch are now
        admitted, and vice versa.



        8.   MEMORY-MANAGEMENT SWITCHES


        By default, DSA uses the maximum amount of available memory.
        Although the program has procedures implementing virtual
        memory management, run-time performance inevitably deteriorates
        rapidly when disk accesses become frequent.  Normally the user
        can confidently rely on the algorithms built into DSA to utilise
        the real RAM on any particular hardware in an efficient way.
        Occasionally, however, there may be some advantage in being able
        to control the memory management manually, and various switches
        are provided for this purpose.

        As a first example, suppose Tree 1 is not required, because
        there is a single input file and no frequency ranked output is
        wanted.  By default, DSA always divides the available memory
        equally between Tree 0 and Tree 1, regardless of the fact that
        only a single filename occurs on the command line.  To avoid
        wasting half the RAM, a switch can be used:

            dsa EXCLUDE.1 #ALP /N0

        which gives all the available memory to Tree 0.  On the other
        hand, it may be advantageous to give Tree 1 more than its usual
        share of memory:

            dsa !VERY.BIG TINY.ONE ##FRE /N3

        Another important case is that in which DSA must multitask with
        other software on a machine with limited RAM.  A switch
        restricts to a minimum the memory demanded by DSA:

            dsa ANALYSE.1 #ALP /M3

        As usual, switches can be combined, permitting fine-tuning of
        the memory requirements:

            dsa ANALYSE.1 #ALP /M3 /N0

        The various memory-management switches are defined in more
        detail in Section 12 below.



        9.  IMPORT OF DATA


        As explained above, the command

           dsa FOX.TXT #ALP ##FRE

        generates two output files in a standard format.  It is
        possible to use these files as the input for a subsequent run
        of DSA, a procedure referred to as the import of data.
        Thus the command

           dsa _ALP DUCK.TXT #NEW

        causes the file ALP to be read into Tree 0.  File DUCK.TXT is
        then read, and its contents added to Tree 0.  Finally, an
        alphabetical list combining words from both ALP and DUCK.TXT
        is written to the output file NEW.  Another example is
        provided by the command

           dsa !_ALP DUCK.TXT #NEW

        which causes the words in the file ALP to be read into Tree 1,
        where they would act as stopwords for the analysis of the file
        DUCK.TXT.

        In the examples above, the imported words are added to the
        existing contents of the tree.  If it wished to clear the tree
        before importing data from a file, the following form of
        command may be used:

           dsa CAT.TXT ##FRE __ALP #NEW

        In the case, the data read from CAT.TXT into Tree 0 is discarded
        before the import from file ALP occurs.

        It is not necessary for the import files (ALP in the above
        examples) to have been generated by DSA:  the user can use any
        other convenient program to create them.  The required format of
        each record in the file is:

           decimal_number_string     character_string    CRLF

        where CRLF denotes the pair of ascii characters signifying the
        end of a line.  The decimal number must be greater than zero
        and less than one thousand million.  The character string must
        have the form used by the DSA program in the trees:  it must
        comprise contiguous alphanumeric characters (no whitespace),
        plus optionally up to four embedded '_' characters, each
        optionally proceeded by a '+' character; a '+' character is
        also optional at the very end of the string.  An input file
        illustrating both the simplest and a more complex case might be:

                     1    NMR
             999999999    NUCL+_MAGNET+_RESONAN+_IMAG+_COIL+

        The whitespace which precedes the decimal number and separates
        it from the alphanumeric string is of arbitrary length:  the
        data is read in a way which is insensitive to column position.
        The records in import files may not exceed 80 characters in
        length.



        10.  SAVE AND UNSAVE:  UNFORMATTED TEMPORARY FILES


        It is sometimes necessary to save the contents of Tree 0 or
        Tree 1 temporarily in an unformatted disk file, and to restore
        the tree from the file at a later stage in the same run of the
        DSA program.  This is done automatically when the /R switch is
        specified in order to restore the contents of Tree 1 after the
        tree has been used for generating a frequency-ranked output file
        (see Section 7 above), and users will usually not need to
        concern themselves with manipulating individual unformatted
        files.  If direct control of data transfer between the trees and
        the disk is considered essential, the required syntax is
        exemplified by the command

            dsa !EXCLUDE.1 ANALYSE.1 }!TEMP ##FRE {!TEMP 

        The contents of Tree 1 are written to the unformatted files
        TEMP.TOK and TEMP.NDX prior to the generation of the frequency-
        ranked formatted output in file FRE;  the contents of Tree 1 are
        subsequently unsaved from the same files.  Notice that the
        chosen filename (in this example TEMP) must have no extension,
        since the program DSA adds the required .TOK and .NDX in order
        to generate two separate filenames.  The above command is
        entirely equivalent to the simpler

            dsa !EXCLUDE.1 ANALYSE.1 ##FRE /R

        (In a more realistic example, the restored contents of Tree 1
        would be used as stopwords for a further statistical analysis
        step.)

        It is also possible to save the contents of Tree 0:  consider
        the (rather pointless) command formulation

            dsa ANALYSE.1 }TEMP.UNF ANALYSE.2 #ALP {TEMP.UNF ##FRE

        The result is that ALP contains an alphabetically ordered list
        of the words from both ANALYSE.1 and ANALYSE.2, whereas FRE
        contains a frequency-ranked list of words taken from ANALYSE.1
        alone.

        Unformatted files can be used in this way only during a single
        run of the DSA program.  An error exit will occur if any attempt
        is made to unsave an unformatted file which was created during
        a previous run.

        
        11. OTHER OUTPUT FILES


        In addition to the various input and output files already
        discussed above, the DSA program generates various other files.
        There are six temporary files, three named *.TOK and three
        named *.NDX, which are deleted on exit unless there has been a
        serious run-time error.

        More interesting is the file DSA.LOG, which contains information
        on the way in which the run progressed, in particular a list of
        the files read and written.  If the /V or /VV switch has been
        specified on the command line, the logfile will also give
        technical details relating to the mechanics of the run, for
        example the amount of memory allocated.  This can be useful when
        developing a command strategy to optimise run-time speed or to
        allow unobtrusive background running.

        Finally, the program DSA will sometimes generate a file named
        DSA.OUT.  This occurs only if the user has inadvertently failed
        to specify any output filename on the command line.  For
        example, the (accidentally incomplete) command

            dsa ANALYSE.1

        will cause an alphabetical list of words to be written to the
        default output file DSA.OUT.

        ---------------------------------------------------------------


        12.  SUMMARY OF SYNTAX

        Filename prefixes:

          name.ext     Read from file name.ext into Tree 0
         !name.ext     Read from file name.ext into Tree 1
         @name.ext     Read list of filenames from name.ext
         ~name.ext     Read list of filenames from name.ext      (for GNU)
         #name.ext     Write an alphabetical wordlist        (for DOS/OS2)
        ##name.ext     Write a frequency-ranked wordlist     (for DOS/OS2)
         %name.ext     Write an alphabetical wordlist           (for UNIX)
        %%name.ext     Write a frequency-ranked wordlist        (for UNIX)
         _name.ext     Import from formatted file name.ext
        __name.ext     Import from formatted file name.ext, after clearing
         }name         Save to unformatted files name.tok and name.ndx
         {name         Unsave from unformatted files name.tok and name.ndx

        Command line switches:

        /A        export also those strings occurring only once in text
        /B        recognise strings beginning with a digit
        /C        ignore all stop words
        /DE     use German stopwords and stemming [subject to /C and /T0]
        /FR     use French stopwords and stemming [subject to /C and /T0]
        /M1       use maximum available dynamic memory [default]
        /M2       use half of available dynamic memory
        /M3       use one third of available dynamic memory
        /N0     give Tree 1 no memory
        /N1     give Tree 1 the same amount of memory as Tree 0 [default]
        /N2     give Tree 1 twice as much memory as Tree 0
        /N3     give Tree 1 three times as much memory as Tree 0
        /P0       exclude all words occurring in Tree 1 [default]
        /P1       exclude words at Tree-1 relative frequency > 0.000001
        /P2       exclude words at Tree-1 relative frequency > 0.00001
        /P3       exclude words at Tree-1 relative frequency > 0.0001
        /P4       exclude words at Tree-1 relative frequency > 0.001
        /P5       exclude words at Tree-1 relative frequency > 0.01
        /P6       exclude words at Tree-1 relative frequency > 0.1
        /Q        reverse logic for excluding words
        /R      restore contents of Tree 1 (after frequency-ranking step)
        /T0     stemming:  none
        /T1     stemming:  inbuilt algorithm [default]
        /T2     stemming:  M.F. Porter's algorithm (English only)
        /V        verbose log file
        /VV       very verbose log file
        /1      statistics of individual words [default]
        /2      statistics of 2-word phrases
        /3      statistics of 3-word phrases
        /4      statistics of 4-word phrases
        /5      statistics of 5-word phrases

        Reserved filenames:  dsa.log, dsa.out, stop.en, stop.de, stop.fr