BDD tests for Hadoop with Cucumber. Part I

Big Data is a processing of a huge amount of structured (more or less) data that should be filtered or sorted for future analysis or, actually made analysis. So much data that you can’t process it on the one computer for a reasonable time. This can be logs from services, like call service, or logs from web servers that contain billions of records, so you should process data in parallel on a bunch of computers. To do it so we use specific software based on MapReduce pattern. In our case, we use Hadoop with Cascading framework. Hadoop implements MapReduce and Cascading framework contains a lot of useful tools, for example abstraction to Amazon S3, so we can use S3 buckets as input/output folders for processing jobs, but at the same time we allow to use regular file systems on developer machines for development and test purposes.

As any other piece of software it should be tested before using in production, especially because of high cost of logical mistakes in processing a lot amount of data. Unfortunately there are no much information about BDD testing of Hadoop and Hive jobs. So I decided to write how we do it in our Cascading project here, in Intelliarts.

Make Code Review Useful Again

Code Review is commonly used technique and it’s definitely one of the must-have processes to keep high quality of code. But quite often it becomes a formal process or people experience different issues using it. So let’s take a look what is Code Review and what good and bad parts we can get using this technique.


Code Review is used for a different reasons regarding to process specifics - for example in open source development it is the only way to include contributed code from the third-party developers. But in a team (especially a small one) the must-have review is not necessary. But they still need it. So why people use Code Review in one team:

  1. Catch logic errors.

  2. Catch missed parts (not updated documentation, missed tests etc.)

  3. Catch missed requirements. Sometimes developer can accidentally overlooked some of the implemented task requirements.

  4. View from the outside. Someone from outside can see issues in code which developer can miss because of ‘bogged down in code’

  5. Check code quality.

  6. Knowledge sharing.

  7. Education. Doing the Code Review developers educating each other.

  8. Make sure implementation complies to coding styles, standards, etc.

Continuous deployment to AWS ECS from CircleCI

You know what it all about. So let’s start.

We have some kind of website with dockerized environment and we want to configure automatic zero-time deployment on push to the our github repository master branch.

Deploy Process

What we a going to use to do that:

  1. CircleCI as build server
  2. Github as code repository
  3. Amazon EC2 Container Service (ECS) as production environment and our deployment target
  4. Amazon S3 bucket to keep our secret keys used by website