Binglin Chang decstery at gmail.com
Fri Feb 3 11:01:46 PST 2012

Suppose there is a table "invites" with columns
  foo int
  bar string

A Hive SQL query
  SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
will be compiled to the physical query plan below, each operator is a
actually a java class,
chained together, so the whole plan can be executed in a "interpret" way.

  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
            alias: a
            Filter Operator
                  expr: (foo > 0)
                  type: boolean
              Select Operator
                      expr: bar
                      type: string
                outputColumnNames: bar
                Group By Operator
                        expr: count()
                  bucketGroup: false
                        expr: bar
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                    sort order: +
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                    tag: -1
                    value expressions:
                          expr: _col1
                          type: bigint
      Reduce Operator Tree:
        Group By Operator
                expr: count(VALUE._col0)
          bucketGroup: false
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: false
              GlobalTableId: 0
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format:

What I am thinking is to translate this physical query plan to a LLVM
IR, the IR should
inline all the operators, because they are all static, and the input
data record type is known
and static too, the LLVM IR should then be compiled to native code as
functions(one for mapper,
and one for reducer maybe), finally I can integrate them with native
MapReduce runtime and run
them on Hadoop.

The input data types are probably described by some sort of schema, or
just a memory buffer
with layouts like C struct..

I don't know if I described clearly, here are some papers mentioned this:

[Google Tenzing] http://research.google.com/pubs/pub37200.html
[Efficiently Compiling Efficient Query Plans for Modern Hardware]


Binglin Chang

On Fri, Feb 3, 2012 at 5:21 PM, 陳韋任 <chenwj at iis.sinica.edu.tw> wrote:
> Hi Chang,
> > I am developing a Hadoop native runtime, it has C++ APIs and libraries,
> > what I want to do is to compile Hive's logical query plan directly to LLVM
> > IR or translate Hive's physical query plan to LLVM IR, then run on the
> > Hadoop native runtime. As far as I know, Google's tenzing does similar
> > things, and a few research papers mention this technique, but they don't
> > give details.
> > Does translate physical query plan directly to LLVM IR reasonable, or
> > better using some part of clang library?
> > I need some advice to go on, like where can I find similar projects or
> > examples, or which part of code to start to read?
>  I don't know how those query language looks like. If the query language
> will turn into some kind of intermediate representation during the execution
> (like how compiler does), then you might need to find what representation
> is easier to be transformed into LLVM IR. Clang is for C-like language. I
> am not sure if Clang's library can help you or not.
> HTH,
> chenwj
> --
> Wei-Ren Chen (陳韋任)
> Computer Systems Lab, Institute of Information Science,
> Academia Sinica, Taiwan (R.O.C.)
> Tel:886-2-2788-3799 #1667


