[LLVMdev] Question on LLVM based query execution
Binglin Chang
decstery at gmail.com
Fri Feb 3 11:01:46 PST 2012
Suppose there is a table "invites" with columns
foo int
bar string
A Hive SQL query
SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
will be compiled to the physical query plan below, each operator is a
actually a java class,
chained together, so the whole plan can be executed in a "interpret" way.
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Filter Operator
predicate:
expr: (foo > 0)
type: boolean
Select Operator
expressions:
expr: bar
type: string
outputColumnNames: bar
Group By Operator
aggregations:
expr: count()
bucketGroup: false
keys:
expr: bar
type: string
mode: hash
outputColumnNames: _col0, _col1
Reduce Output Operator
key expressions:
expr: _col0
type: string
sort order: +
Map-reduce partition columns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col1
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
What I am thinking is to translate this physical query plan to a LLVM
IR, the IR should
inline all the operators, because they are all static, and the input
data record type is known
and static too, the LLVM IR should then be compiled to native code as
functions(one for mapper,
and one for reducer maybe), finally I can integrate them with native
MapReduce runtime and run
them on Hadoop.
The input data types are probably described by some sort of schema, or
just a memory buffer
with layouts like C struct..
I don't know if I described clearly, here are some papers mentioned this:
[Google Tenzing] http://research.google.com/pubs/pub37200.html
[Efficiently Compiling Efficient Query Plans for Modern Hardware]
www.vldb.org/pvldb/vol4/p539-neumann.pdf
Thanks,
Binglin Chang
On Fri, Feb 3, 2012 at 5:21 PM, 陳韋任 <chenwj at iis.sinica.edu.tw> wrote:
>
> Hi Chang,
>
> > I am developing a Hadoop native runtime, it has C++ APIs and libraries,
> > what I want to do is to compile Hive's logical query plan directly to LLVM
> > IR or translate Hive's physical query plan to LLVM IR, then run on the
> > Hadoop native runtime. As far as I know, Google's tenzing does similar
> > things, and a few research papers mention this technique, but they don't
> > give details.
> > Does translate physical query plan directly to LLVM IR reasonable, or
> > better using some part of clang library?
> > I need some advice to go on, like where can I find similar projects or
> > examples, or which part of code to start to read?
>
> I don't know how those query language looks like. If the query language
> will turn into some kind of intermediate representation during the execution
> (like how compiler does), then you might need to find what representation
> is easier to be transformed into LLVM IR. Clang is for C-like language. I
> am not sure if Clang's library can help you or not.
>
> HTH,
> chenwj
>
> --
> Wei-Ren Chen (陳韋任)
> Computer Systems Lab, Institute of Information Science,
> Academia Sinica, Taiwan (R.O.C.)
> Tel:886-2-2788-3799 #1667
> Homepage: http://people.cs.nctu.edu.tw/~chenwj
More information about the llvm-dev
mailing list