Intelligent Hypothesis Generating

How to use modular data architectures to efficiently generate & test hypotheses.

Zach Wolpe
4 min readAug 2, 2021

Exciting problems tend to have some hierarchical depth of complexity. Intelligent problem solving, then, requires mechanisms to abstract to a suitable level of complexity, given the current requirements & available resources.

I’m a data scientist so in my case, this applies to testing theories captured in data — however, think of this as a specific instance of a broader problem-solving approach.

Here I provide a paradigm for handling complexity through modular design.

Example Problem Background

As a component of my Masters degree I’m studying how individuals are able to learn under uncertainty. To achieve this we’re designing a mathematical construct to examine the disparity between performance in particular neurological executive functionality. This sounds very fancy but, like most problems, we have a series of resources available (the data) & are asking a specific question:

How do we capture the differences in learning performance across individuals?

Machine Learning Engineering

The machine learning community is “growning up” — transitioning from fanciful scientific (though impractical) enquiry to robust (production ready) engineering. Science is the domain of exploration whilst engineering is the domain of application — all successful technologies make this transition.

One salient manifestation of this maturation is modularisation.

In this example — modular software. Things have to work is modules to allow them to embed in existing ecosystems. The modularisation of machine learning algorithms has largely taken place (and increasingly so), to capitalize on this:

One should design every element of thier pipeline as independent modules to enable quick & seamless interchangable components.

Solution

So what does this actually look like? The objective is to design our raw inputs as modular “blocks” that can be compiled in different ways. All we need to do is define the blocks & relationships between the blocks.

In our case, we have a series of raw text files that capture the performance of individuals across a number of neuropsyhological tasks.

Our goal is to conduct this standard data science process.

Data Science Process

1. Exploratory Data Analysis (EDA)
2. Model Free Analysis
3. Theoretical Discussion
4. Model Based Analysis

This leads us to the following requirements.

Requirements

1. The ability to access the data intuitively.
2. The ability to define relationships in the data as required.
3. The ability to add functionality (methods) as required.
4. The ability to rapidly test & experiment as required.

Solution

We process the .txt files to extract the information into a series of dictionaries & dataframes. This is where we’re often tempted to stop! We have the data, so why not begin exploring it?
Instead, we write a class structure that:

Class Structure

- Hosts the raw data
- provides a series of methods to transform & visualize the data
- defines the relationships between the datatypes

This approach allows us to perform any calculation, transformation or transaction between any nodes in the database.

Results

This hyper-modular structure allows as to process the data iteratively, swapping out different datasets over the abstracted methods.

This modularisation of data allowed me to spin up this dashboard with Plotly-Dash (Python). It performs the function of EDA & Model-free design by allowing the user to:

1. Test the relationship between any variable set in the data.
2. Build intuition about potential correlations in the data.
3. Visualize & express dependencies in the data.
4. Select a colour scheme - because things should be pretty ;)

Next Steps

The purpose of this extensive groundwork is to enable robust rapid future development. We now yeild the toolkit to:

  • Automate combinatorical variable testing
  • Configure modular machine learning systems
  • Effectively search the state of possible models for one which sufficiently captures the data.

Let’s see what we can do!

Code: our project is open source & the code is available on request.

Afterthought

This is very much the paradigm of the modern scientific doctrine — mathematics is our best instance of modular design to date. Whilst obviously powerful, the incompleteness arises from the incongruence between emergence & reductionism. This building-block thinking aims to subsume both approaches, whilst tending to reductionism, it allows for emergence through interactions of the nodes.

--

--