What is frequency consolidation?
To illustrate how SubString handles frequency consolidation among
different length n-grams, let us assume we have as input the n-grams
in (1)a. These will have been extracted from a corpus and their
frequency of occurrence in the corpus is indicated by the number
following each n-gram.
The 4-gram 'have a lovely time' occurs with a frequency of 15. The
trigrams 'have a lovely' and 'a lovely time' occur 58 and 44 times
respectively. 15 of those occurrences are, however, occurrences as part
of the superstring 'have a lovely time' (since they are substrings of
'have a lovely time'). To get the consolidated frequency of occurrence
for 'have a lovely' and 'a lovely time' (i.e. the occurrences of these
trigrams on their own, NOT counting when they occur in a longer string),
we therefore deduct the frequency of their superstring (15) from their
own frequency. This results a consolidated frequency of 43 for 'have a
lovely' (i.e. 58 minus 15) and 29 for 'a lovely time' (i.e. 44 minus
15), as shown in (1)b.
The remaining bigrams ('have a', 'a lovely' and 'lovely time') are also
substrings of 'have a lovely time' and therefore also need to have their
frequency reduced by 15 (resulting in a frequency of 34692 for 'have a',
86 for 'a lovely' and 30 for 'lovely time'. In addition, 'have a' and 'a
lovely' are substrings of 'have a lovely' and therefore the frequency of
'have a lovely' which is now 43, needs to be deducted from their
frequencies. This results in a new frequency of 34649 for 'have a' and
43 for 'a lovely'. 'a lovely' and 'lovely time' are furthermore
substrings of 'a lovely time' and consequently need to have their
frequencies reduced by that of 'a lovely time' (i.e. by 29): the
consolidated frequency of 'a lovely' is now 14, that of 'lovely time' is
1. The output of the frequency consolidation is shown in (1)b.
(1) a have a lovely
time 15
b have a lovely
time 15
have a
lovely
58
have a
lovely
43
a lovely
time
44
a lovely
time 29
have
a
34707
have
a
34649
a
lovely
101
a
lovely
14
lovely
time
45
lovely
time
1
A more in-depth theoretical description and justification of the
algorithm employed by SubString is found in Buerki (2017). See also
O'Donnell (2011) for a discussion of issues involved and alternative
approaches.
back
last update:
2018-01-02