Classical Incremental Refresh For Cloud Data Sources in Power BI Service (for Pro accounts)

[TL;DR]

In this post I describe how to implement the classical incremental refresh scenario for the cloud data sources in Power BI Service for Pro accounts. Step by step. It worth to read.

Foreword

As I wrote in the previous post, we can implement a semi-incremental refresh for Pro accounts in Power BI, using just dataflows.

The method I described has a main lack: although the “historical” dataflow remains unchanged and will never load data from the source again, the “refreshing” dataflow will load the whole “fresh” part of data repetitively until you change the date intervals manually.

Initially – there’s small amount of data in the “fresh” part…

Fresh data become not so fresh…

…but, after some consequential refreshes, it could become significant, and not so fresh.

You can again split “fresh” it in the two parts – “new historical” and “fresh”, and so on. But this is only SEMI-incremental refresh, and, of course, is not a good solution.

It seemed that implement a complete, classic incremental update using just dataflows is impossible.

But, after some investigations, I found a solution which helps to implement the classical incremental refresh scenario, where the fresh data part remains small and fresh, and historical part become updated without querying a data source.

Fresh data remains fresh 🙂

At least for the cloud data sources.

Warnings

  • Power BI dataflows is still in beta, so their behavior could be changed in any time.
  • The solution below works only for data sources which do not require on-premises gateway. I still working on the solution for gateways, but for now it seems impossible (or I just cannot find it). Anyway, you can use the semi-incremental approach for such data sources.

Concept (again)

The solution itself looks a little bit tricky, but it is very simple and do not require the deep knowledge of M language or other mysterious concepts.

It is based on the concept of “self-referencing tables” or
“self-referencing queries” in Excel Power Query, described in the various blog posts (the best are by Matt Allington and Imke Feldmann).

It could be described in the few words as following:

  1. Create a query to the source and load the output to the worksheet table (once).
  2. Create a second query which will get the new data.
  3. Change the 1st query so it now referenced to its results (a table, loaded on the worksheet).
  4. Append 2nd query in the 1st query.

So, after you refresh 1st query, it will read previously stored results, append the new data to it and again store it in the same table on the worksheet.

It works fine but limited to the size of Excel table.

Until recent time this was the only “incremental refresh” we can get (not in Power BI) because Power Query had only one ETL destination (“smart table”), which could be reached by Power Query itself.

Now, with the help of dataflows, we’ve got a new destination of Power Query: Azure Data Lake Storage Gen2. Even it being managed by Power BI, we can reach it and read data from it. We only need to find a solution how to rewrite already stored data without querying a data source again.

Solution

I will show you how it works on the sample project which queries Yandex.Metrika API (popular web-analytics platform in ex-USSR).

Before we start the project, we need to decide, what kind of incremental refresh we would like to do:

  • We will just add new data to the existing data set, or
  • We will add new data with updating a part of existing data set

For this example I decided to select 2nd option – we will pull stats for the last 7 days (because data could be corrected by the service), so after each daily refresh we need to replace last 6 days of existing data and add a new one.

Below I described the whole process in the steps and screenshots

1.      Create a dataflow

Go to the workspace where you want to create a dataflow and… just create it.

2.      Get historical data

First of all, we need to get a historical data (as much as we need, for example, up to yesterday) and store it in the ADLS.

So, we need to create an entity which will query the API. In my case I created 4 entities:

  • PQYM” function which performs an API call and pulls the data
  • DateFrom” text parameter (to be used as the PQYM argument) – in this example just =”2016-01-01”
  • DateTo” text parameter (to be used as the PQYM argument) – in this example = “yesterday”
  • Main” query which invokes PQYM function with DateFrom and DateTo arguments. This query calls an API and then just renames columns and sets data types for columns (setting data types is necessary for the dataflows).
Base queries created

“Enable load” option for the function and parameters must be set off:

Uncheck “Enable load” option for each query except Main

Ok, now we must save this dataflow AND refresh it. You may have to wait for a half a minute before dataset will be refreshed.

3.      Connect to the saved data

Go to the Power BI Desktop and connect to the “Power BI dataflows (beta)”.

Connect to Power BI dataflows

Expand the workspace, then find the dataflow you just created and expand it too. There you can see the one table (in my case named “Main”), which contains already stored historical data.

Click “Edit” and open the Query Editor. You can see a table with the query results. Go to the column we want to use as a date filter for incremental refresh, then apply a filter on it “Before…”. Just select some nearest date, we will change it soon.

Filter it “Before…” with some nearest date

That’s all with this query. Just right-click on the query name (on the left-side panel) and copy the query. You can close this file – we don’t need it more.

Copy the query code from Desktop

We’ve got a code, which is generated by Power BI Desktop to connect to the saved dataflow.

Now, go back to the Power BI Service, go to the workspace, select the dataflow we created and click “Add entities” button.

Add new entity to the dataflow

Create “Blank Query” and replace code in the Advanced Editor window with the copied one.

Save this entity as “Staging” (or give it any other name which will tell you that there’s historical data lies). You can see a lot of warnings – ignore them, and, again, disable load for this entity. Warnings should go away.

This query will be used to get the SAVED data from the ADLS storage.

Then we need to provide credentials for this query – it is your Power BI account credentials.

So, we’ve got 2 entities (queries) in our dataflow, one of which is for obtaining data and storying it in the ADLS, other is for read this saved data (but not load it elsewhere).

4.      Change the data range

What to do with the new data range? Now starts the tricky part, so be careful.

The API which I use in this project allows to set dates as text strings in “YYYY-MM-DD” format and allows to use keywords as “today” and “yesterday”.

So I just change the “DateTo” parameter to the “yesterday”, but for the “DateFrom” parameter I need to write some piece of code:

= Date.ToText(Date.From(DateTime.FixedLocalNow()-#duration(7,0,0,0)), “yyyy-MM-dd”)

Set DateFrom to the date which is 7 days ago from now

This query read the date and time when it is evaluated, then calculates a date 7 days before and converts it to the desired string format.

5.      Change the historical dataset last date

Go to the “Staging” query, and in the formula bar change the filter: instead of #datetime(…) or #date(…) that we see there write “DateTime.From(DateFrom)”. So now we’ve set the query, which read the data from the ADLS, to exclude all the days before the new fresh data to be retrieved.

Change filter on data column to the DateFrom parameter value

6.      Create query to get fresh data

Then we need to create a query to get fresh data for the last 7 days.

As the query used to retrieve fresh data is the same and differs from the “Main” only by dates range, we can just duplicate the “Main” query. Let do so – right click on the main query, select “Duplicate” and set this new query to “not load” state too. Then we can rename it to “Refreshing”.

As we didn’t change any code in this query, it will read the DateFrom and DateTo parameters, which we already set to the “last 7 days” range.

7.      Edit Main query

And there’s the last step.

All we need is to go to the “Main” query. Open the Advanced Editor and replace all code in its window with the following:

Final touch…

8.      Finish it

Save the dataflow and set a scheduled refresh with the pace you need – daily, or other.

That’s all. Now you can connect to this dataflow from the Desktop and do what you need.

Conclusion

So, what we have got? We have only one dataflow. In this dataflow we have:

  • a few parameters,
  • one function to get data,
  • one query which read saved data,
  • one query to get fresh data, and
  • one query which combines saved and fresh data and loads it to the ADLS (only this query is enabled to load).

The main trick is in the 2nd – when we performed a first call to the API and saved its result. We’ve got a historical data set, then on the 7th step we changed this query a little bit. This query will never read the historical part of data from the source again – it will consume its own previous result and combine it with the fresh data.

Easy-peasy 😉

To be honest, reading this post take much more time then this process. At least writing it took much, much more time.

Hope it will help you to avoid unnecessary calls to API or OData sources. Or any other source which could be reached without using on-premises gateway.

For these sources which is could be reached only via gateways, I think that there should be a solution – I just didn’t find it yet.

Share this
Video: Semi-incremental refresh for Pro accounts in Power BI

Comments

  1. Hello again,

    Thank you very much for taking the time to answer my question. Actually, i believe i have another scenario and a different goal. I actually have a rather slow ODATA source (loads around 250k rows in 7’30”) and what i hoped was to improve this time by doing an incremental update. However, the time is more or less the same, if i load the whole table or use the incremental refresh.

    And i believe this behaviour happens, and please correct me if i am wrong, because on step 6, when creating the code for the “Refreshing” query, it is pulling all the rows from the ODATA source and then applies a filter in Power Query on the last 7 days. Unless i am actually missing something, and in the function from step 1 you are doing something else.
    What i think would improve the time would be if i could add a parametrised filter in the ODATA url and get the data to be pre-filtered already before being loaded to PQ.

    Best regards!

    • Maxim Zelensky
      August 23, 2019 - 16:54

      I think there’s the catch: you need to correctly pass date filters in your OData request when you query it for the last 7 days.
      In my example, I passed start/end dates to the function which translates them in a correct API request.
      You need to check what is the notation of your OData source to get not full data, but for the desired period. Then change your query accordingly.
      Cheers,
      Maxim

  2. Hello,

    This is a great post and a very smart approach. I was wondering what was the refresh time improvement on your part with incremental as opposed to loading the full table? I have used your method for loading an ODATA source and the refresh time for incremental is only slightly better than loading the full table.

    Thank you and best regards!

    • Maxim Zelensky
      August 21, 2019 - 12:45

      Thanks for the kind words!
      This approach requires a full re-read of historical data from the ADLS (for filtering), so it could be not so fast as desired. But the main goal is to limit the amount of data consumed from the source.
      So look on it like this:
      1. We store historical data in ADLS
      2. On the next refresh, we should read all historical from ADLS and filter it. It could be memory (in Power BI Service) and performance consuming, but old data won’t be downloaded from the initial source.
      So, maybe the amount of historical data from your OData source is not so big, or you have a very good data transfer speed, or your OData query not folded (due to some reasons) and ignores dates filter (and loaded all historical data again)

      Cheers,
      Maxim

  3. Graeme Bruijn
    May 22, 2019 - 11:54

    Hi Maxim,
    I have another question about the article and I was wondering if I could e-mail you? My e-mail address is XXXX and it would be amazing if I could contact you.
    Best regards,
    Graeme Bruijn

  4. Graeme Bruijn
    May 13, 2019 - 18:00

    Hi Maxim,

    Thank you for this post, looks very promising. I just wanted to make sure that the first option you gave as solution (‘We will just add new data to the existing data set, or’) is the same as the 2nd without updating the last few days of the historical data? If yes, is the procedure almost the same as you have now outlined for the 2nd option?

    Thanks alot,
    Graeme

    • Maxim Zelensky
      May 13, 2019 - 19:23

      Hi Graeme
      Yes, you are right, the 1st and the 2nd option differs only in time window settings for the Refreshing query (only get fresh data, for example, “yesterday”) and a filter, applied to the Staging query (for example, no filter). I think the best way to implement the 1st option is to check for the last date on Staging and get the new data from this date in Refreshing
      Maxim

Leave a Reply

Your email address will not be published / Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.