SubString, licensed under the EUPL V.1.1.

The SubString package is a cross-platform, open-source set of scripts used for
substring reduction and frequency consolidation of word n-grams. It consists
 of three separate command-line modules (substring-A, substring-B and an
auxiliary scripts module). Substring-A and -B each implement a different algorithm
for consolidating the frequencies of word n-grams of different lengths (i.e. of
different n). The process of frequency consolidation reduces the frequencies
of substrings by the frequencies of the superstrings in which they are contained
and an output list is produced showing the consolidated frequencies of all
n-grams (see below for more on frequency consolidation). The auxiliary
scripts module provides a number of additional functions related to the
filtering of n-gram lists. The functions performed by this package will primarily
be of interest to linguists and computational linguists working on formulaic
language, multi-word expressions and other phraseological phenomena.

substring-A (algorithm with indexation)

This is a python script designed to work in conjunction with the mwetoolkit.
It takes as input a corpus-indexed list of n-grams (as produced by mwetoolkit)
and then uses an exact, indexation-based algorithm to consolidate frequencies
of overlapping n-grams, following Altenberg and Eeg-Olofsson (1990: 16-17).

The Substring-A algorithm is generally preferable to substring-B in cases
where access is available to the source texts/corpora from which n-grams are to
be extracted, and mwetoolkit can be used to extract n-grams from those texts.

substring-B (algorithm without indexation)

This is a set of Unix shell scripts designed to work on a range of simpler n-gram
list formats (where n-grams are not indexed to their occurrences in a specific
 corpus). Such lists are produced by the NGramProcessor, the Ngram Statistics
and various other tools and sources such as Google Books Ngrams.
The algorithm implemented by substring-B is described in detail in Buerki (2017).

If access to source texts/corpora is unavailable (only n-gram lists are available)
or if the limitations of mwetoolkit mean that a different tool needs to be used
 for n-gram extraction, substring-B should be used. Access to source corpora may be
 unavailable in case of sources like Google Books Ngrams, or because online corpus
portals such as the Sketch Engine are used that allow the creation of n-gram lists,
 but not the download of the full underlying corpus data. N-gram lists might require
some formatting before processing, see the README file for substring-B.

auxiliary scripts

Several auxiliary scripts are included in the SubString package which allow the further
 processing of n-gram lists after frequency consolidation. These scripts can be used
after processing with substring-A or substring-B. See the TUTORIAL included in
SubString for details on these scripts.

What is frequency consolidation? | project page on github | about the author

download current release | download previous versions


last update: 2018-09-03