Introduction to Big Data Testing, How to handle Big Data?, SQL Databases vs. NoSQL Databases, testing structured and unstructured databases.

What is Big Data?
Big Data means data that is huge in size. Big data is a term used to describe a collection of data that is huge in size and yet growing exponentially with time.
Examples of Big Data generation includes stock exchanges, social media sites, jet engines, etc.
Data formats in Big Data
1. Structured Data
This refers to data that is highly organized, It can be easily stored in any relational database. This also means that it can be easily retrieved / searched using simple queries.
Ex: Table (Columns and Rows)format data.
2. Semi Structured Data
Semi-structured data is not rigidly organized in a format that can allow it to be easily accessed and searched.
Semi-structured data is not usually stored in a relational database. It can contain tags and other metadata to implement a hierarchy and order.
Examples of Semi-Structured Data
CSV, XML and JavaScript Object Notation (JSON)
3. Unstructured Data
Unstructured data does not have any predefined format, it does not follow a structured data model, and it is not organized into a predefined structure.
Examples of Unstructured Data
Images, videos, word documents, presentations, mp3 files etc.
Examples and Usage Of Big Data
E-commerce applications
Amazon, Flipkart and other e-commerce sites have millions of visitors each day with hundreds of thousands of products. They use Big Data to store information regarding products, customer and purchases.
Social Media applications
Social media sites Facebook, Twitter, Instagram, etc. use Big Data to generate huge amounts of data in terms of pictures, videos, likes, posts, comments etc.
Healthcare applications
Stock Market applications
Two types of Databases for storing the data
Relational or SQL Databases: MS Access, MS SQL Server, Oracle, MySQL, SyBase, DB2, DB/400, etc.
NoSQL Databases: MongoDB, CouchDB, CouchBase, Cassandra, HBase, Redis, etc.
What is Big Data Testing?
Testing of a big data application in order to ensure that all the functionalities of a big data application works as expected.
The General approach to test a Big Data Application involves the following stages.
1. Data Ingestion
The first step for deploying a big data solution is the data ingestion i.e. extraction of data from various sources.
The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds etc.
2. Data Processing
After data ingestion, the next step is to store the extracted data. The data either be stored in HDFS (Hadoop Distributed File System) or NoSQL database.
3. Validation of the Output
The final step in deploying a big data solution is the data processing. The data is processed through one of the processing frameworks like Spark, MapReduce, Pig, etc.
Big Data Testing Strategy?
There is various type of testing in Big Data projects such as Database testing, Functional testing, Infrastructure testing, and Performance Testing.
In Big data testing, QA engineers verify the successful processing of terabytes of data using commodity cluster and other supportive components.
Subsets of Big Data Testing
Data Ingestion Testing
Data Storage Testing
Data Processing Testing
Data Migration Testing
Big Data Tools For Data Analysis
Xplenty
Apache Hadoop
CDH (Cloudera Distribution for Hadoop)
Apache Cassandra
Knime
Datawrapper
MongoDB
Etc.
Challenges in Testing Big Data:
1. The volume of the data is one major challenge for testing.
2. Test environment and automation should be developed for different platforms
3. No single tool can perform end to end testing
4. High Degree of scripting is required for designing test cases
5. Automated Big Data Testing procedures are predefined and not suited for unexpected errors.