For good or bad HDFS is where the data is placed today and we all know this data is hard and slow to query and analyze. The good news is that the people at Cloudera created impala, basic idea is: fast SQL on Hadoop (HDFS or HBase) using the Hive metastore.
For good or for bad Hadoop is not were Data Science happens it usually happens in R or python, good news is that the folks at cloudera made sure that is very easy to extract data out of impala using standard technologies such as ODBC or thrift.
In this post I try some tools to extract data from impala to python (pandas) to do in memory analysis.
The setup is quite simple since I am running the cloudera (CDH 4.6) in my own computer using their virtual machine, just be sure to port forward port 21050, if you are in the cloud (EC2) just be sure to open that port. The data is just a table of 100,000 rows and 3 columns in AVRO format based on the getting started with AVRO example.
pyodbc gives you access to data stores via ODBC, install it using anaconda by running
conda install pyodbc, the code is very straight forward. Note that the driver location is for OS X, change it on linux or windows.
import numpy as np import pandas as pd
connection_string = '''Driver=/opt/cloudera/impalaodbc/lib/universal/libclouderaimpalaodbc.dylib; HOST=localhost; PORT=21050'''
connection = pyodbc.connect(connection_string, autocommit=True)
cursor = connection.cursor()
cursor.execute('SELECT * FROM avro_users LIMIT 100')
<pyodbc.Cursor at 0x1059e6330>
I took the code below from the impyla library, read more about it below.
def as_pandas(cursor): names = [metadata for metadata in cursor.description] return pd.DataFrame([dict(zip(names, row)) for row in cursor], columns=names)
df = as_pandas(cursor)
5 rows × 3 columns
"Fast, memory-efficient Python interface for SQL databases, NoSQL stores, Amazon S3, and large data files."
IOPro is a library form the amazing people at continuum analytics, it is not free but you can try it for 30 days for free, I have a student license and I use it from time to time, is very good. The main idea is to load data into numpy arrays directly, it skips python objects so is very memory efficient. For ODBC they forked pyodbc and added a few extra methods to fetch data.
import iopro.pyodbc as io_pyodbc
connection = io_pyodbc.connect(connection_string, autocommit=True)
io_cursor = connection.cursor()
Using IOPro you can fetch a dictionary of numpy arrays (
fetchdictarray()) or just a numpy array (
io_cursor.execute('SELECT * FROM avro_users LIMIT 100')
<pyodbc.Cursor at 0x105ac37b0>
columns = [metadata for metadata in io_cursor.description]
data = io_cursor.fetchdictarray()
df = pd.DataFrame.from_records(data, columns=columns)
5 rows × 3 columns
You can see that there are some problems with the data, the
name column is empty, and less crucial
favorite_color is full of "NA" strings. To fix this you need to fall back the
fetchall() which creates a python list of tuples and then use that to create the pandas DataFrame, not the best approach since the idea of IOPro is go directly to numpy skipping python objects.
So for now if you want to use IOPro stick to numeric columns.
import impala.dbapi import impala.util
imp_connection = impala.dbapi.connect(host='localhost', port=21050) imp_cursor = connection.cursor() imp_cursor.execute('SELECT * FROM avro_users LIMIT 100')
<pyodbc.Cursor at 0x10a1108d0>
df = impala.util.as_pandas(imp_cursor)
5 rows × 3 columns
This is a very simple speed test reading 100,000 rows from impala. Each case includes from the connection to creating a pandas DataFrame.
%%timeit -n5 -r5 connection = pyodbc.connect(connection_string, autocommit=True) cursor = connection.cursor() cursor.execute('SELECT * FROM avro_users') data = cursor.fetchall() pd.DataFrame.from_records(data, columns=columns) cursor.close()
5 loops, best of 5: 18.2 s per loop
%%timeit -n5 -r5 connection = io_pyodbc.connect(connection_string, autocommit=True, ansi=True) io_cursor = connection.cursor() io_cursor.execute('SELECT * FROM avro_users') data = io_cursor.fetchall() pd.DataFrame.from_records(data, columns=columns) io_cursor.close()
5 loops, best of 5: 19.5 s per loop
%%timeit -n5 -r5 imp_connection = impala.dbapi.connect(host='localhost', port=21050) imp_cursor = connection.cursor() imp_cursor.execute('SELECT * FROM avro_users') impala.util.as_pandas(imp_cursor) imp_cursor.close()
5 loops, best of 5: 18.8 s per loop
Not surprisingly all of them are very similar since 1. the data is not really big (100,000 x 3) and 2. is located in a local virtual machine.
On a previous execution of the timing I got that iopro was faster than the default ODBC and that impyla was even faster, so please do not take this numbers as a complete benchmark. Also note that where IOPro excels is memory efficiency, for more info look at this blog post.
From the three choices the easier to get up and running fast is impyla since thrift is easier to install, so if you don't care that much about memory and care more about getting the data out of impala I would recommend it. ODBC gives you nice features like transaction, but transactions are not available in impala (yet) so not much of a killer feature there.
Do you have any other alternatives to extract data from Impala to python or general HDFS to python, specially to numpy arrays? Let me know. I will keep trying this technologies, specially with some real "big" data and post more results if I find something interesting.