SubString

SubString, licensed under the EUPL V.1.1.
The
SubString package is a cross-platform, open-source set of scripts used for

substring reduction and frequency consolidation of word n-grams. It consists

 of three separate command-line modules (substring-A, substring-B and an

auxiliary scripts module). Substring-A and -B each implement a different algorithm

for consolidating the frequencies of word n-grams of different lengths (i.e. of

different n). The process of frequency consolidation reduces the frequencies

of substrings by the frequencies of the superstrings in which they are contained

and an output list is produced showing the consolidated frequencies of all

n-grams (see below for more on frequency consolidation). The auxiliary

scripts module provides a number of additional functions related to the

filtering of n-gram lists. The functions performed by this package will primarily

be of interest to linguists and computational linguists working on formulaic

language, multi-word expressions and other phraseological phenomena.

substring-A (algorithm with indexation)

This is a python script designed to work in conjunction with the mwetoolkit.

It takes as input a corpus-indexed list of n-grams (as produced by mwetoolkit)

and then uses an exact, indexation-based algorithm to consolidate frequencies

of overlapping n-grams, following Altenberg and Eeg-Olofsson (1990: 16-17).

The Substring-A algorithm is generally preferable to substring-B in cases 

where access is available to the source texts/corpora from which n-grams are to 

be extracted, and mwetoolkit can be used to extract n-grams from those texts.

substring-B (algorithm without indexation)

This is a set of Unix shell scripts designed to work on a range of simpler n-gram 

list formats (where n-grams are not indexed to their occurrences in a specific

 corpus). Such lists are produced by the NGramProcessor, the Ngram Statistics

Package and various other tools and sources such as Google Books Ngrams.

The algorithm implemented by substring-B is described in detail in Buerki (2017).

If access to source texts/corpora is unavailable (only n-gram lists are available) 

or if the limitations of mwetoolkit mean that a different tool needs to be used

 for n-gram extraction, substring-B should be used. Access to source corpora may be

 unavailable in case of sources like Google Books Ngrams, or because online corpus 

portals such as the Sketch Engine are used that allow the creation of n-gram lists,

 but not the download of the full underlying corpus data. N-gram lists might require

some formatting before processing, see the README file for substring-B.

auxiliary scripts

Several auxiliary scripts are included in the SubString package which allow the further

 processing of n-gram lists after frequency consolidation. These scripts can be used

after processing with substring-A or substring-B. See the TUTORIAL included in 

SubString for details on these scripts.

What is frequency
consolidation? | project page on github | about the author

download current
release | download previous versions
last update: 2018-09-03