What is frequency consolidation?


To illustrate how SubString handles frequency consolidation among
different length n-grams
, let us assume we have as input the n-grams
in (1)a. These will
have been extracted from a corpus and their
frequency of occurrence in
the corpus is indicated by the number
following each n-gram.


The 4-gram 'have a lovely time' occurs with a frequency of 15. The
trigrams 'have a lovely' and 'a lovely time' occur 58 and 44 times
respectively. 15 of those occurrences are, however, occurrences as part
of the superstring 'have a lovely time' (since they are substrings of
'have a lovely time'). To get the consolidated frequency of occurrence
for 'have a lovely' and 'a lovely time' (i.e. the occurrences of these
trigrams on their own, NOT counting when they occur in a longer string),
we therefore deduct the frequency of their superstring (15) from their
own frequency. This results a consolidated frequency of 43 for 'have a
lovely' (i.e. 58 minus 15) and 29 for 'a lovely time' (i.e. 44 minus
15), as shown in (1)b.

The remaining bigrams ('have a', 'a lovely' and 'lovely time') are also
substrings of 'have a lovely time' and therefore also need to have their
frequency reduced by 15 (resulting in a frequency of 34692 for 'have a',
86 for 'a lovely' and 30 for 'lovely time'. In addition, 'have a' and 'a
lovely' are substrings of 'have a lovely' and therefore the frequency of
'have a lovely' which is now 43, needs to be deducted from their
frequencies. This results in a new frequency of 34649 for 'have a' and
43 for 'a lovely'. 'a lovely' and 'lovely time' are furthermore
substrings of 'a lovely time' and consequently need to have their
frequencies reduced by that of 'a lovely time' (i.e. by 29): the
consolidated frequency of 'a lovely' is now 14, that of 'lovely time' is
1. The output of the frequency consolidation is shown in (1)b.

(1) a   have a lovely time   15      b    have a lovely time    15
        have a lovely        58           have a lovely         43
        a lovely time        44           a lovely time         29
        have a            34707           have a             34649
        a lovely            101           a lovely              14
        lovely time          45           lovely time            1

A more in-depth theoretical description and justification of the
algorithm employed by SubString is found in Buerki (2017). See also
O'Donnell (2011) for a discussion of issues involved and alternative
approaches.


back

----------------------------------------------------------

last update: 2018-01-02