Presto – Another Query Engine Like Apache Drill or Phoenix – Optimized for OLTP

What is Presto:

has a SQL interface to query.
connects to multiple databases including Cassandra(which Drill can’t).
a big plus – OLTP support with analytic and data warehousing capabilities.
Will not be super quick as Phoenix but if you give petabytes level huge data and complex queries, it comes back like a champion.
Facebook, Dropbox etc use it for it’s huge bid data complex analytics.(At Facebook 1000s of their employees use 300 PB scale data to run more than 30K queries everyday)
exposes interfaces/connectors to JDBC, command line and Tableou(used by data analysts)
connectors for Cassandra, Hive, MOngoDB, MySQL, Redis, Kafka etc to name a few.

Installing Presto:

We will install Presto and then make it connect to Hive and Cassandra to pull the data for us from these 2 data sources and then we can use Presto SQL Commands to process data from both the data sources.
Download the latest Presto server tar file, unzip it and change to the presto directory:
- wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.193/presto-server-0.193.tar.gz
- tar -xvf presto-server-0.193.tar.gz
- cd presto-server-0.193
We can see that in the ‘presto-server-0.193‘ directory there is no ‘etc’ folder for the configurations. Presto does not provide that in the tar file. Instead you have to follow the documentation in ‘Configuration Section at – https://prestodb.io/docs/current/installation/deployment.html or download the config from my S3 website –
- make sure you are in ‘presto-server-0.193′
- wget https://s3.amazonaws.com/testbucket786786/presto-hdp-config.tgz
- tar -xvf presto-hdp-config.tgz
- It will create the etc directory for you.
- Let us see the contents of “config.properties” files which show the server port, memory etc. – cat etc/config.properties
  - coordinator=true
    node-scheduler.include-coordinator=true
    http-server.http.port=8090
    query.max-memory=10GB
    query.max-memory-per-node=1GB
    discovery-server.enabled=true
    discovery.uri=http://127.0.0.1:8090
- Lets also check the file – cat etc/catalog/hive.properties
  - connector.name=hive-hadoop2
    hive.metastore.uri=thrift://127.0.0.1:9083
    hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
- Finally we would need the Command Line Interface Jar file
  - move to the bin folder of ‘presto-server-0.193‘ directory – cd bin
  - download the command line interface file – wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.193/presto-cli-0.193-executable.jar
  - rename the file to presto – mv presto-cli-0.193-executable.jar presto
  - make presto file executable – chmod +x presto
- lets now run the Presto server (make sure you are in the bin directory of ‘presto-server-0.193’ – ./launcher start
- Now lets run the UI from the browser – http://127.0.0.1:8090
- The Presto UI
- Now that the Presto server is running now lets connect to Presto command line interface – ./presto –server 127.0.0.1:8090 –catalog hive –schema movielens
- In the above command you are running the presto CLI giving the parameters that server is running at 127.0.0.1:8090, you want to connect to hive and the schema is movielens
- At this stage please make sure that in Hive you already have the table ratings data in the schema movielens . If not go to this post to first create schema/database movielens and upload the ratings table – https://mohdnaeem.wordpress.com/2018/01/06/how-to-process-data-using-hadoop-hive/
- Yay!!!… you are connected to presto CLI and then you executed the follwing 2 commands to see which tables exit and the data
  - to show tables : show tables;
  - to see data : select * from ratings limit 5;
- The same can be verified from the Presto Dashboard too:
- Now lets connect Presto to Cassandra ( you can visit the post – https://mohdnaeem.wordpress.com/2018/01/15/big-data-integration-with-cassandra-using-spark/ to push users data into Cassandra, if not have done yet)
- Now you have to create Cassandra connector properties file in ‘etc/catalog’ folder in ‘presto-server-0.193‘ directory – nano cassandra.properties and add these 2 lines and save
  - connector.name=cassandra
    cassandra.contact-points=127.0.0.1
- Now move back to “bin” folder and execute –
  - ./launcher start
  - ./presto –server 127.0.0.1:8090 –catalog hive,cassandra
  - See that in catalog we are saying that we want to connect hive and Cassandra both
  - To see tables from Cassandra – show tables from cassandra.movielens;
  - To see tables from Hive- show tables from hive.movielens;
  - To describe table from Cassandra – describe cassandra.movielens.users;
  - To describe table from Hive- describe hive.movielens.ratings;
  - To see table data from Cassandra – select * from cassandra.movielens.users limit 5;
  - To see table data from Hive- select * from hive.movielens.ratings limit 5;
  - Now once we are have done these basic operations, now lets join –
    - select u.occupation, count(*) from cassandra.movielens.users u join hive.movielens.ratings r on r.user_id=u.user_id group by u.occupation
  - Yay!!!, you handled with data from 2 different data sources and processed it.
  - If you want to connect to other data sources here is the documentation for them – https://prestodb.io/docs/current/connector.html