Hello
Power BI seems to work in a way that means it cannot get the best out of Azure SQL Data Warehouse - more specifically Power BI seems likely to generate unnecessary data movement between the compute nodes. I want to outline my use case, check my understanding and see if anyone else has any comments/thoughts…
Example: I have the following (subset of) tables containing web analytics data from an e-Commerce solution:
Session
PageView
PageEvent
(One session has multiple page views, and each page view has multiple events).
Session joins to PageView on SessionID
PageView joings to PageEvent on PageViewID
In Azure SQL DW, all three tables contain the SessionID and all three tables are Hash distributed on this field.
This means all the data for a particular session resides within one compute node in the Azure SQL DWH.
This means hand-written analytical queries typically involve no data movement between compute nodes (other than to return the final results) as they are written roughly as follows:
SELECT ...
FROM Session s
INNER JOIN PageView pv
ON s.SessionID = pv.SessionID
INNER JOIN PageEvent pe
ON pv.PageViewID = pe.PageViewID and
pv.SessionID = pe.SessionID
…
When Power BI is connected to Azure DWH via Direct Query, it presumably won't generate the last line in the query above (the additional join criteria on SessionID), since it only supports one field joins.
This looks like it will lead to additional data movement between the compute nodes of an Azure SQL DWH instance.
I have investigated this by running a query (from Management Studio) similar to the above with and without the last line, and reviewing the steps in the query execution from sys.dm_pdw_request_steps.
With the “pv.SessionID = pe.SessionID” line:
step_index operation_type distribution_type location_type
0 OnOperation Unspecified Control
1 PartitionMoveOperation Unspecified DMS
2 ReturnOperation Unspecified Control
3 OnOperation Unspecified Control
Without the “pv.SessionID = pe.SessionID” line:
step_index operation_type distribution_type location_type
0 RandomIDOperation Unspecified Control
1 OnOperation AllComputeNodes Compute
2 BroadcastMoveOperation Unspecified DMS
3 OnOperation Unspecified Control
4 PartitionMoveOperation Unspecified DMS
5 OnOperation AllComputeNodes Compute
6 ReturnOperation Unspecified Control
7 OnOperation Unspecified Control
This additional join criteria is important since it enforces that the scope of the query is within each control node and so no data movement between nodes is needed until returning results.
This seems to make Power BI less useful as a client to generally browse an Azure SQL Data Warehouse. Of course, it is still possible to write specific queries using the additional join criteria, but then that isn’t using Power BI as a general DWH client.