anthe.sevenants

Connecting morphosyntax and lexical semantics with Elastic Net regression

2022-10-03

In October 2022, I started working on the FWO-funded project "Connecting morphosyntax and lexical semantics with Elastic Net regression", supervised by prof. dr. Freek Van de Velde and prof. dr. Dirk Speelman. The project proposes to use regularization methods from machine learning, more specifically Elastic Net regression (and its siblings Ridge and Lasso), to look into lexical semantic effects in morphosyntactic alternances.

Why Elastic Net regression?

Previously in corpus linguistics studies, if we wanted to find out whether there was a lexical semantic influence on a morphosyntactic alternance (i.e. whether we use specific lexemes with a specific construction), we had to reduce the number of 'levels' we could take into account – we could not take all possible lexemes into consideration at the same time. With Elastic Net, we can use all possible lexemes as predictors at the same time, because the technique can successfully eliminate superfluous predictors and retain the interesting ones. To do this, Elastic Net applies shrinkage to the coefficients, which means it can be used for variable selection, especially when the number of predictors is very large. This makes it great for variationist studies, where the number of predictors tends to balloon if one wishes to enter lexemes associated with a construction into a regression model to predict constructional variants. Once the constructional "pull" of all lexemes has been determined using the Elastic Net technique, we use distributional semantics to find lexical semantic patterns.

In the project, we combine the Elastic Net regulator with k-fold cross-validation - a standard procedure - to avoid overfitting. The Elastic Net approach mitigates the various drawbacks present in alternative approaches that are currently used in variationist linguistics, like random factors in mixed models and collostructional analysis. The project offers a transparent pipeline that can easily be extrapolated to other case studies, and to other languages.

Presentations

Tools made for this project

Datasets from this project