Cosmos DB command line utility

On the Big Data and Cloud space, tools evolve day in and day out and Microsoft is playing an important role by improving their products, making sure they don’t stay behind other players.

On the document, graph, key-pair side of things, Microsoft launched this year (May 2017) Azure Cosmos DB – based upon the former Azure Document DB – which is a cloud distributed storage service.

Our problem

So the time has passed and we’ve crossed paths with this beauty called Cosmos DB. We don’t choose all the technologies we’ll work with. There are times when customers and partners have issues with their choices and we’re here to help.

What happened is that on a given day we needed to build a fast-automated-scalable data pipeline having Cosmos DB as one of the data sources and all we had was an API endpoint access to the database.

There are no easy ways or tools on the Linux space to dump data from a Cosmos DB Instance like we needed – so we’ve built one.

Another opportunity we’ve explored is to, not only expose command line access, but also provide ways to stream data from a Java API and therefore, enable the use of our bundle into java applications like Pentaho Data Integration.

Cosmos DB export

There you go: https://github.com/oncase/cosmosdb-export/.

The main features are:

  • Export data to CSV, JSON or stdout;
  • Freedom to type queries on a given collection;
  • Usage from shell on Linux, Windows or Mac;
  • Usage from Java applications;
  • Configurable page size;
  • Configurable Use Partition flag;
  • Configurable CSV delimiter, enclosure and line chars;

We were able to build this tool quite easily due to the existing Microsoft Azure DocumentDB Java SDK and the fact that it wraps the logic of their REST API Configuration.

Side note: It’s funny to see how tricky is to rebrand a big thing – the documentdb string is still present at the url, but the content reflects Cosmos DB.

Usage

The purpose here isn’t to dive into details, but I think it’s interesting to show a bit of the tool. Below, I’ll show two examples: one via command line and another by accessing the java API.

Command line

This command exports to /opt/landing/incoming.csv the top 100 documents on the collection books-collection where the author is Machado de Assis.

./cosmosdb-export \
--host https://my-documentdb.documents.azure.com:443 \
--key kfaBzsrx3zUTxxZtGXGXrk5TS2XTy7yBUTE7AtyfTVDy8YfJ46dbAgH94bHHULhPhkvUkdsYc55uKFSjemJGsTpr \
--database my-db \
--collection books-collection
--enable-partition-query \
--fields id,book_title,author,tags
--where " WHERE r.author = 'Machado de Assis' "
--limit 10
--file /opt/landing/incoming.csv

Java

Here’s an example using the utility from a Java application:

/**
 * Method that connects, queries and iterates over a query
 */

public void streamData() throws EmptyDatabaseException, 
  EmptyCollectionException, DocumentClientException{

  // Setup parameters
  FlatFileExporter.params = new ExecutionParameters();
  FlatFileExporter.params.host = "https://my-documentdb.documents.azure.com:443/";
  FlatFileExporter.params.key = "kfaBzsrx3zUTxxZtGXGXrk5TS2XTy7yBUTE7AtyfTVDy8YfJ46dbAgH94bHHULhPhkvUkdsYc55uKFSjemJGsTpr";
  FlatFileExporter.params.db = "my-db";
  FlatFileExporter.params.collection = "books-collection";
  FlatFileExporter.params.enablePartitionQuery = true;
  FlatFileExporter.params.fields = "id,book_title,author,tags";
  FlatFileExporter.params.where = " WHERE r.author = 'Machado de Assis' ";
  FlatFileExporter.params.setLimit(10);

  // Reference to the fields we want to collect from each doc
  String[] fields = FlatFileExporter.params.getFieldsArray();

  // Connect and get the iterator over the query
  CollectionHandler coll = FlatFileExporter.getCollectionHandler();
  QueryIterable iterable = coll.getIterableFieldsWhere(
    FlatFileExporter.params.getLimit(), 
    FlatFileExporter.params.getQueryFields(), 
    FlatFileExporter.params.where );

  // For each block of documents
  List next = iterable.fetchNextBlock();
  while( next!= null && next.size() > 0){
    Iterator docs = next.iterator();

    // For each document of the block
    while(docs.hasNext()){
      Document doc = (Document) docs.next();

      // Will store the fields here
      Object[] cols = new Object[fields.length];

      // Run through fields and store them
      for(int i = 0 ; i < cols.length ; i++) {
        try {
          cols[i] = getOutputVal( doc.get( fields[i] ) );
        } catch(Exception e) {
          cols[i] = "";
        }
      } 

      // Now do something with the values of the row
      putRow(cols);
    }

    // Go to next block of documents
    next = iterable.fetchNextBlock();
  }

  // Close the connection when no documents left to be read
    coll.getDocumentClient().close();
}


/**
 * Helper method to force no `null` string on output
 */
public Object getOutputVal( Object value ){
  if( value == null){
    return "";
  } else {
    return String.valueOf(value);
  }
}

Conclusion

The point we want to share is: on the day to day routine of a Business Analytics team, there should be no boundaries. Our team know that they’re helping shape the future. Analytics, Big Data and Cloud tools are being built exactly now and we want be part of the party. We want to get something that helped us and share with other teams so they can improve it and speed up the ecosystem as a whole.

We’re more and more driving towards to building products of our own so that, if they can speed us up, we also enable the community to get faster and then everybody start having new challenges – things we’re got at.

Marcello Pontes
Analytics Architect & Front-ender at Oncase |