How Model 2 Fact Tables with some common dimensions!

Hello

trying to model 2 Fact Tables with some common dimensions,

I managed to generate the model but the results of the Query are very wrong, it is like it is doing a cross join or something

is there a documentation on how to model multiple fact tables ?

thanks

dataset BIENGINETESTING
// this is an auto-generated default dataset
// please feel free to delete it, if not used
table actual budget category_dim
join actual inner join category_dim
on actual.category = category_dim.category
join budget inner join category_dim
on budget.category = category_dim.category

Hi Mim - thanks for your question. I’ve flagged with the customer support team and someone will get back to you ASAP - Jeremy

Update
Or maybe our CEO who invented our data modeling language will jump in…

Do you have a one-to-many relationship between each fact table table and the dimension table?

Fact1
id metric1
1 2
1 3

Fact2
id metric2
1 20
1 30

DimensionTable
id
1

In that case, according to standard SQL semantics, a simultaneous join is equivalent to a cross join.

Fact1 JOIN Fact1 JOIN DimensonTable
id metric1 metric2
1 2 20
1 2 30
1 3 20
1 3 30

Please let us know if our understanding is correct. Whether we are right or wrong in our interpretation of your problem, if we know your intent and how your tables are related, we can help find the correct solution.

BTW if you haven’t already done so, you can look at the auto-generated SQL by clicking on “View SQL” (near top right) when you are exploring a dataset.

here is my model

I expect the SQL to be something like this

SELECT category,
actualhours ,
budgethours ,

FROM (
SELECT category, SUM(actualhours) AS actualhours
FROM xxxxxxxx.starschema.actual

GROUP BY category
) t2
FULL OUTER JOIN (
SELECT category, SUM(budgethours) AS budgethours
FROM xxxxxxx.starschema.budget
GROUP BY category
) t3 USING(category)

We strive to provide general and powerful solutions to wide a variety of cases, while respecting basic semantics of SQL.

I understand if you expect the above SQL to be generated from your data model. Generalizing it to the case where the two JOINs use different columns may be hard. We will certainly look into this.

Meanwhile, one possible data model that would produce queries similar to your intent is as follows.

table budget
  sql xxxxxxx.starschema.budget
  datasource <yourDataSource>
  ...

table actual
  sql xxxxxxx.starschema.actual
  datasource <yourDataSource>
  ...

table joinableBudget
  sql {budget category sumBudgetHours}
  datasource <yourDataSource>
  column sumBudgetHours
    sql SUM(budgethours)
    type int
  column category
    sql ...
  ...

table joinableActual
  sql {actual category sumActualHours}
  datasource <yourDataSource>
  column sumActualHours
    sql SUM(actualhours)
    type int
  column category
    sql ...
  ...


dataset BIENGINETESTING
  table joinableBudget joinableActual
  join joinableBudget full outer join category_dim
    on joinableBudget.category = category_dim.category
  join joinableActual full outer join category_dim
    on joinableActual.category = category_dim.category

We could have defined a join rule between joinableBudget and joinableActual directly. But the above solution will work even when you select additional columns from category_dim.

The {} syntax for a table sql is documented here: Views using Structured SQL | bipp Analytics

As always, we solicit and welcome any feedback. If any of this looks unnecessarily complex, we would love to hear back from you.

1 Like

Thanks a lot, that’s very interesting approach, maybe it will be useful to show cases like this in the documentation

Yes, I completely agree. Out documentation is still far from ideal. We are working on it.

We now have a better solution for your use case in bipp.

I created tables ‘budget’, ‘actual’ and ‘category_dim’ on bigquery using your schema. Then I created a project in bipp called ‘multipleFactTables’. I connected to bigquery and added the three tables to the project. Then I created a dataset (notice the “oneToMany” keyword):

dataset multipleFactTables
  table actual budget category_dim
  join oneToMany category_dim left join actual
    on category_dim.category = actual.category
  join oneToMany category_dim left join budget
    on category_dim.category = budget.category

Then I can do what you want:
image

The underlying auto-generated query is:

SELECT t000._1 AS _0, SUM(t000._0) AS _1, SUM(t000._2) AS _2 FROM (
  SELECT t001._1 AS _0, t001._0 AS _1, SUM(t000.budgetHours) AS _2 
  FROM `test-datasources-294120.oneToMany`.`budget` AS t000 
  RIGHT OUTER JOIN (
    SELECT t001.category AS _0, SUM(t000.actualHours) AS _1 
    FROM `test-datasources-294120.oneToMany`.`actual` AS t000 
    RIGHT OUTER JOIN `test-datasources-294120.oneToMany`.`category_dim` AS t001 
    ON t001.category=t000.category 
    GROUP BY 1) AS t001 
  ON t001._0=t000.category 
  GROUP BY 1, 2) AS t000 
GROUP BY 1

Hope that helps.

1 Like

That’s great, it works as expected.
now, when I try to add another dimension, it generate an error

error : Error from makeSql: internal error: hell

I apologize for that. Can you help me figure out what the problem is? Can you share with me your bippLang data model and what columns you select?

We just deployed a version with better logging. Can you try again? Maybe that will be enough for us to debug it.

dataset starschena
table actual budget category_Dim_mv commodity_dim_mv
join oneToMany category_Dim_mv left join actual
on category_Dim_mv.category = actual.category
join oneToMany category_Dim_mv left join budget
on category_Dim_mv.category = budget.category
join oneToMany commodity_dim_mv left join actual
on commodity_dim_mv.commodity = actual.commodity
join oneToMany commodity_dim_mv left join budget
on commodity_dim_mv.commodity = budget.commodity

when trying to use two dimensions :

category_Dim_mv.category

commodity_dim_mv.commodity

and a metric from Actual

I get this error : error : Error from makeSql: internal error: assertion failure

The error message generated for you was unhelpful. But there is a genuine problem in your use case.

You are trying to simultaneously use two one-to-many relationships with the same “many” table, while choosing an aggregate from that “many” table.

category_Dim_mv -one-to-many-> actual
commodity_dim_mv -one-to-many-> actual

selected: category_Dim_mv.category, commodity_dim_mv.commodity, sum(actual.actualhours)

In general, this is ill-defined.

Consider:

  • “one” table X with dimension a with values a1 and a2.
  • “one” table Y with dimension b with values b1 and b2.
  • a common “many” table Z with measure SUM(m).

Selection of a, b and SUM(m) leads to something like:

X.a Y.b Z SUM(m)
a1 b1 ?
a1 b2 ?
a2 b1 ?
a2 b2 ?

I am not sure about the intended meaning of the third output column.

Do you agree?

I am changing the error message to:

cannot aggregate from “many” table, actual, in multiple simultaneous one-to-many relationships

Hi @mim and @angshuman.guha

Does it make sense to ignore the one-to-many relationships between the fact and dimension tables in this case.

If all aggregates are coming from the the many table (aka the fact table), we can just treat this as a 3 way join and simple group by .

Something like - Select dim1.col1, dim2.col2, SUM(fact.metric1) from fact join dim1 ON <expr> JOIN dim2 ON expr group by 1,2

Do you agree @mim ?

Sorry for the late reply, maybe a diagram will explain better, basically what’s the best way to model this two facts Tables with two dims, the measure are in both fact tables

Your use of oneToMany is correct.
We were forbidding the selection of

category_Dim_mv.category, commodity_dim_mv.commodity and sum(actual.actualhours)

We have pushed a change so that now you should be able to do that.

For your latest diagram, you can do:

join oneToMany category_dim left join budget
join oneToMany category_dim left join actual
join oneToMany contractor_dim left join budget
join oneToMany contractor_dim left join actual

You can select dimensions from contractor_dim and category_dim together with an aggregate from budget (or actual). In that case, bipp will ignore the oneToMany keyword and the results should be meaningful.

If you select dimensions from contractor_dim and category_dim together with aggregates from both budget and actual, bipp will ignore the oneToMany keyword again and the results may not be meaningful.

If you select a dimension or aggregate from contractor_dim and aggregates from both budget and actual, bipp will use its special algorithm to produce a nested SELECT.

Please let us know if we are doing the right thing in your opinion.

1 Like

I am getting this error : error : Error from makeSql: multiple ways to join: “budget” and “category_dim_mv”

That’s interesting. Exactly what columns did you choose?

this is my dataset

dataset starschena
table actual budget category_Dim_mv commodity_dim_mv
join oneToMany category_Dim_mv left join actual
on category_Dim_mv.category = actual.category
join oneToMany category_Dim_mv left join budget
on category_Dim_mv.category = budget.category
join oneToMany commodity_dim_mv left join actual
on commodity_dim_mv.commodity = actual.commodity
join oneToMany commodity_dim_mv left join budget
on commodity_dim_mv.commodity = budget.commodity

an here is my explore

Yes, of course!

Here is an excerpt from some internal documentation, as yet unpublished. Please let me know if this makes sense.


What happens if there are loops (cycles) in the graph?

Example 1:

Table t1 has a left join relationship with table t2, and an inner join relationship with table t3.
Table t2 and t3 also have an inner join relationship.

This loop is allowed to exist in a bipp dataset. But once a set of columns has been selected and the auto-SQL generator has been invoked, loops are not allowed.

For example, if a column is selected from t1 and another from t2, the auto-generated SQL will use the left join, ignoring the other two join relationships. But if a column is selected from each of the three tables, then we have a problem. To join three tables, you need two JOIN rules. In this case, any two JOIN rules out of the three would have sufficed. Since there is no a priori reason to ignore a specific JOIN rule over another, bipp’s auto-SQL generator will issue an error message.

Example 2:


If columns are chosen from t1 and t2, we have a problem (run-time error).

If columns are chosen from t1, t2 and t3, there is no problem because we can ignore the join relationships involving t4.

In order to precisely explain the underlying algorithm, we define a path to be compact if it only includes tables that are selected. For example, if t1 and t2 are selected, the path
t1 - t3 - t2
is not compact.

The auto-SQL generator will complain about loops (“multiple paths”) if the following condition is violated:

  1. either exactly one compact tree exists
  2. or there are no compact trees and exactly one non-compact tree
1 Like