I am looking for advice on how best to model the data in the scenario outlined below. The core of this problem is to do with the scalability and simplicity in mind.
Consider a scenario where have three tables (this is a simplified use case). The first is an object table that contains 3 things unique object id, name and object type. The second is a feature value table that contains unique object id, feature id and feature value. The third and final table is feature table that contains unique feature id and feature name.
In terms of scale a customer may typically have 100,000s of objects with 1,000s od features. Some customers much more and some customers much less. It is likely the data is extremely sparse, by that I mean an object may only 40 or 50 feature values but, these will be spread across the many different features.
I am sure this row type structure of data storage has a name but it eludes me at the moment.
The problem is how to analyze the data according to specific features. Consider as an example the features activity type, launch date, col and weight. I want to use the structure above to report on all objects of the object type footwear and create a heat map by activity type with drill down to color. Review this by product at different points in time before such as before 2024 or after 2023.
The challenge is how to report on the feature value table that has values that could be of type date, integer, float, list of values, text, etc. For a given feature all values will be of the same type but the feature values table contains 1,000s of features of various types for 100,000s of objects. The featureValues table can be millions of rows.
Obviously I would wish to avoid creating separate tables for each feature as this has scalability and manageability issues.
What is the best approach to modeling this data to allow business analysts work with it easily in Power BI. The actual data source is not excel but JSON loaded into me MS Fabric Lakehouse and transformed using PySpark.
Thoughts and pointers on how to do this appreciated, I have a very limited scope of remodeling the data to simplify the problem if needed.
Sample data below
object
id | name | type |
A0001 | Sneaker 1 | footwear |
A0002 | Sneaker 2 | footwear |
A0003 | Sneaker 3 | footwear |
A0004 | Sneaker 4 | footwear |
A0101 | Ski Glove 1 | gloves |
A0102 | Ski Glove 2 | gloves |
featureValues
id | featureId | value |
A0001 | FID00001 | 2024-01-01 |
A0001 | FID00002 | Tan |
A0001 | FID00003 | Running |
A0001 | FID00004 | 900 |
A0002 | FID00001 | 2024-01-31 |
A0002 | FID00002 | Blue |
A0002 | FID00003 | Hiking |
A0002 | FID00004 | 1200 |
A0003 | FID00001 | 2023-12-21 |
A0003 | FID00002 | White |
A0003 | FID00003 | Tennis |
A0003 | FID00004 | 950 |
A0004 | FID00001 | 2023-03-01 |
A0004 | FID00002 | Blue |
A0004 | FID00003 | Running |
A0004 | FID00004 | 775 |
A0101 | FID00001 | 2024-01-01 |
A0101 | FID00003 | Skiing |
A0101 | FID00004 | 267 |
A0102 | FID00001 | 2023-10-01 |
A0102 | FID00003 | Sailing |
A0102 | FID00004 | 201 |
feaureNames
featureId | name |
FID00001 | Launch Date |
FID00002 | Color |
FID00003 | Activity |
FID00004 | Weight (g) |