Data Wrangling

Number

daw

ECTS

4.0

Specification

Importing, cleaning and transforming data

Level

Intermediate

Content

Before a model can be reasonably trained on a data set, it must be cleaned and prepared accordingly. This so-called data wrangling should not be underestimated and comprises about 80% of the daily work of a data scientist. It includes all steps from importing data from a suitable source, cleaning the data, adding further data sources and transforming the data into a format suitable for the desired model. Often, data processing pipelines have to be created that undergo continuous adjustments in the course of modeling.
In the Exploratory Data Analysis competency, students have already gained an initial insight into the basic techniques of data wrangling with R/RStudio and expand on the techniques learned here with practical examples.

Learning outcomes

Students are able to import data from commonly used file formats (text, CSV, Excel) and data formats (JSON, XML) into an appropriate data structure and are able to obtain data from relational and non-relational databases.

Students can clean data sets, finding outliers and errors, removing duplicates, marking missing values as such or replacing them with plausible values, and defining data types appropriate for the problem. The effects of impurities on simple models are understood and have been experienced on real data.

Students understand how to transform data appropriately in terms of answering a research question, specifically sorting, filtering, grouping, aggregating, combining, reshaping, and creating more meaningful derived variables for the research question.

Students are able to link external data sources with existing data in an appropriate way (joins) and can also combine information with uncertainty using similarity measures (regular expressions, string distances, ...).

Students understand how to appropriately abstract frequently reused processes as functions into efficient and well-readable data pipelines.

Students are able to use the scripting language R (especially the packages from *tidyverse*) or the software RStudio to create, maintain and document data processing pipelines. They know the common data types and structures and can use them appropriately.

Students are also able to implement data processing pipelines in the Python programming language (especially with the pandas package) and know the advantages and disadvantages of both frameworks.

Evaluation

Mark

Built on the following competences

Foundation in Programming, Exploratory Data Analysis, Foundation in Databases

Modultype

Portfolio Module