Load JSON Data from Cloudant into dashDB
See how to create a new dashDB instance and populate it with data directly from a Cloudant account. Note: The following video is for the dashDB managed service, and does not apply to IBM dashDB Local.
What this tutorial is about
This tutorial will show you how to create a new dashDB instance and populate it with data directly from a Cloudant account.
What you should be able to do
- Provision a new dashDB instance from Cloudant
- Populate the dashDB instance with data from Cloudant
- Work with the created tables in dashDB
- Run real-time replication from Cloudant to dashDB
- Stop real-time replication
- Delete the dashDB instance
What you need before you start
- An IBM Bluemix Account. If you don't have an account, sign up for a free account at https://bluemix.net
- An IBM Cloudant Account. If you don't have an account, you can sign up for a free account at https://cloudant.com
- At least one small database in your Cloudant account. This tutorial will work with 2 databases from https://examples.cloudant.com called movies_demo and geo. You can either replicate those two databases into your own account or work with any other Cloudant database you may already have.
Step 1: Provision a dashDB instance from Cloudant
- Log in to your Cloudant dashboard.
- Optional: On the Databases tab, view the list of databases in your Cloudant account to find suitable database names, then open a database.
- Optional: Open and inspect individual documents in your database to understand their data and structure.
- Click the Warehousing tab.
- If you want to create a DB2 warehouse, then click Create a DB2 Warehouse, and complete the form.
- Type the Warehouse Name.
- For Data Sources, type
- Next, specify the connection information; for example:
- Click Create Warehouse.
- To create a dashDB warehouse, then click Create a dashDB Warehouse, and complete the form.
- Type your IBM id and password, and click Authenticate in Bluemix.
- Type the Warehouse Name.
- For the Data Sources, type
geo. As you start typing, the type-ahead will list databases that match the characters you are typing.
- Select the location for the new warehouse: In this scenario, select Create a new dashDB instance.
- Check Customize Schema for both databases.
- Provision the new dashDB instance by clicking Create Warehouse.
When you click Create Warehouse, two things happen:
- The process creates a new dashDB service in your Bluemix account.
- It scans both databases to understand the document structure in each database.
- Warehouse provisioned displays in the top left corner of the screen.
- On the Warehousing tab, click the warehouse link to open the warehouse configuration.
- Information about the new warehouse displays such as warehouse name, the list of source databases in the warehouse, the size of those databases, when they were last updated, and the current status (running).
- Click Customize movies-demo.
- Check/uncheck the columns to include in the warehouse.
- If this database contained multiple tables, you could select individual tables, or entire table hierarchies to deselect and omit from the target warehouse.
- The view provides some interesting statistics as well, including the number of documents that have a value for a field in the Frequency column.
- Change column types as you see fit.
- Change column length as you see fit.
- Click Search. The Search view allows you to find columns across table boundaries. This comes in handy when you have hundreds of columns with similar names, such as, CODE followed by a unique number (CODE_XXX). Using search, you wouldn't have to deselect all of those columns manually.
- Click Rescan. The Rescan option provides two different discovery algorithms and a small tuning variable. You can even increase the discovery sample from the default 10,000 documents to a higher value that yields better results. Cloudant databases with mixed documents will especially benefit from the power of the Cluster algorithm. It will produce different schemas for every document type in your database and no longer just merge them all into a single set of tables. Click Cancel or Rescan.
When you are ready to use the schema you have, click Run.
- In this scenario, you don't need to customize the schema for the geo database, so click Resume for that database to load the documents.
When you run or resume a source database, two things happen:
- The process creates tables in the new dashDB database to represent these documents.
- It copies the data from Cloudant to dashDB.
- The progress indicator shows the number of documents being copied and the color-coded progress.
- Green status indicates that Cloudant has loaded as many documents into dashDB as are currently in the source databases. Updated documents or new document revisions automatically update the corresponding records in dashDB in real-time replication.
- Blue status indicates that there were problems either during the initial load or the ongoing replication.
- Click View in dashDB to launch the dashDB console.
Step 2: Work with the tables in your dashDB instance
- Click Go to your tables in the dashDB console.
- Select the Schema and then the Table Name to inspect the Table Definition and view the created database tables.
Note: Tables are created in a Schema with a name identical to the dashDB instance name. The schema is selected by default, but there are other sample schemas available in the default dashDB instance. Make sure to select the right schema to find the tables.
- Select the Browse Data tab to view the data populated into the tables.
Note: The Warehousing process may have created multiple tables for a single Cloudant database. All tables are prefixed with the capitalized database name, for example GEO_.
- Optional: Inspect the GEO_OVERFLOW table.
Note: This table is created to capture warnings and exceptions that may happen during load. There is one OVERFLOW table for every source database (for example, GEO_OVERFLOW).
Step 3: Stop Cloudant replication, rescan, or delete the dashDB instance
- Log back into your Cloudant dashboard, and on the Warehousing tab, click View Warehouses.
- Click to open the movies-geo warehouse configuration.
- Stop the database load with the Stop action for the geo database.
Note: Since the dashDB load from Cloudant is real-time replication, the load will never stop automatically. Even if all documents have long been processed, the Stop action is necessary to disconnect from the Cloudant changes feed. After stopping a database, you have the option to Rescan or Remove that database.
- Stop the movies-demo database.
- Click the Rescan action for the geo database.
Note: The rescan function inspects the previously discovered JSON schema and removes all tables from the dashDB instance created during the initial load. When you rescan, you can also choose to customize the schema or change the algorithm used during the rescan. Then it re-discovers the JSON schema, creates new tables, and ingests the Cloudant data.
- Optional: Now that both databases are stopped, you have the option to Remove the warehouse which just removes the warehouse from the Cloudant dashboard, but leaves the dashDB instance intact. Or you can Delete the dashDB instance as well which will de-provision the dashDB instance and delete all data in it - even data that has been created manually or loaded outside of Cloudant.
- Click Resume to reload both databases.