Big Data is a processing of a huge amount of structured (more or less) data that should be filtered or sorted for future analysis or, actually made analysis. So much data that you can’t process it on the one computer for a reasonable time. This can be logs from services, like call service, or logs from web servers that contain billions of records, so you should process data in parallel on a bunch of computers. To do it so we use specific software based on MapReduce pattern. In our case, we use Hadoop with Cascading framework. Hadoop implements MapReduce and Cascading framework contains a lot of useful tools, for example abstraction to Amazon S3, so we can use S3 buckets as input/output folders for processing jobs, but at the same time we allow to use regular file systems on developer machines for development and test purposes.
As any other piece of software it should be tested before using in production, especially because of high cost of logical mistakes in processing a lot amount of data. Unfortunately there are no much information about BDD testing of Hadoop and Hive jobs. So I decided to write how we do it in our Cascading project here, in Intelliarts.