Quantcast
Channel: Power Query topics
Viewing all articles
Browse latest Browse all 31104

Best practice to model large volumes row orientated data of multiple data types

$
0
0

I am looking for advice on how best to model the data in the scenario outlined below. The core of this problem is to do with the scalability and simplicity in mind.

 

Consider a scenario where have three tables (this is a simplified use case). The first is an object table that contains 3 things unique object id, name and object type. The second is a feature value table that contains unique object id, feature id and feature value. The third and final table is feature table that contains unique feature id and feature name.

 

situ042_0-1728314765426.png

 

In terms of scale a customer may typically have 100,000s of objects with 1,000s od features. Some customers much more and some customers much less. It is likely the data is extremely sparse, by that I mean an object may only 40 or 50 feature values but, these will be spread across the many different features.

 

I am sure this row type structure of data storage has a name but it eludes me at the moment.

 

The problem is how to analyze the data according to specific features. Consider as an example the features activity type, launch date, col and weight. I want to use the structure above to report on all objects of the object type footwear and create a heat map by activity type with drill down to color.  Review this by product at different points in time before such as before 2024 or after 2023.

 

The challenge is how to report on the feature value table that has values that could be of type date, integer, float, list of values, text, etc. For a given feature all values will be of the same type but the feature values table contains 1,000s of features of various types for 100,000s of objects. The featureValues table can be millions of rows.

 

Obviously I would wish to avoid creating separate tables for each feature as this has scalability and manageability issues.

 

What is the best approach to modeling this data to allow business analysts work with it easily in Power BI. The actual data source is not excel but JSON loaded into me MS Fabric Lakehouse and transformed using PySpark.

 

Thoughts and pointers on how to do this appreciated, I have a very limited scope of remodeling the data to simplify the problem if needed.

 

Sample data below

 

object

idnametype
A0001Sneaker 1footwear
A0002Sneaker 2footwear
A0003Sneaker 3footwear
A0004Sneaker 4footwear
A0101Ski Glove 1gloves
A0102Ski Glove 2gloves

 

featureValues

idfeatureIdvalue
A0001FID000012024-01-01
A0001FID00002Tan
A0001FID00003Running
A0001FID00004900
A0002FID000012024-01-31
A0002FID00002Blue
A0002FID00003Hiking
A0002FID000041200
A0003FID000012023-12-21
A0003FID00002White
A0003FID00003Tennis
A0003FID00004950
A0004FID000012023-03-01
A0004FID00002Blue
A0004FID00003Running
A0004FID00004775
A0101FID000012024-01-01
A0101FID00003Skiing
A0101FID00004267
A0102FID000012023-10-01
A0102FID00003Sailing
A0102FID00004201

 

feaureNames

featureIdname
FID00001Launch Date
FID00002Color
FID00003Activity
FID00004Weight (g)

Viewing all articles
Browse latest Browse all 31104

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>