Introduction to Data Science: What is Big Data?
What Is Big Data
First, we will discuss how big data is evaluated step by step process.
Evolution of Data
How the data evolved and how the big data came.
Nowadays the data have been evaluated from different sources like the evolution of technology, IoT(Internet of Things), Social media like Facebook, Instagram, Twitter, YouTube, many other sources the data has been created day by day.
1. Evolution of Technology
We will see how technology is evolved as we see from the below image at the earlier stages we have the landline phone but now we have smartphones of Android, IoS, and HongMeng Os (Huawei) that are making our life smarter as well as our phone smarter.
Apart from that, we have heavily built a desktop for processing of Mb's data that we were using a floppy you will remember how much data it can be stored after that hard disk has been introduced which can stored data in Tb. Now due to modern technology, we can be stored data in the cloud as well.
Similarly, nowadays we noticed that self-driving Car comes up. Now you must be thinking about why we are telling that you noticed the enhancement of the technology we are generating a lot of data. Let's see the example of your phones, Have you ever notices how much data is generated due to your fancy smartphones in your every action even one video is sent through any WhatsApp or any other Messenger App that generate data. Now, this is just an example you have no idea how much data you generated because of every action you do. This data is not in the format that the Relational databases can handle and apart from that even the volume of the data has also increase exponentially.
Now we are talking about self-driving cars basically this car having sensors that record every minor detail like the size of the obstacle, the distance of the obstacle and many more then it decides how to respond. You can imagine how much data is generated for each kilometer drive on that car. Let's move on to the next evolution of the data.
|Evolution of Technology.|
I think you people must hear about IOT if we recall the previous paragraph about the self-driving car it is nothing but its an example of IOT. Let me discuss what exactly it is. IOT connects the physical device with the internet and makes a device smarter. Nowadays we have noticed the smart AC, TV, etc, So we will take an example of Smart Air Conditioners this device monitor your body temperature and outside temperature accordingly maintain what should be the temperature of the room.
Now in order to do this first, it accumulates the data from where it can accumulate data from the internet through sensors that monitoring data from your body temperature and surrounding. Basically from various sources that might you know about is actually fetching the data and accordingly it decide what should be the temperature of your room. Now actually we see because of in IOT we are generating a huge amount of data. As we are seeing in the below image there are a lot of IoT devices in future 2020 there will be 50 billion IoT devices. We will not discuss there how IOT will generate such a huge amount of smart devices. Now we will move forward and discuss another factor that generates big data.
3. Social Media
Social media is one of the most important factors in the evolution of big data. Nowadays everyone using Facebook, Instagram, Youtube, Twitters and a lot of social media websites. As we see these social media websites have soo much data. e.g If we have our personal details like our name, age apart from that with each picture we like, reacts and comments it also generates data. Even Facebook pages that we go around liking that also generates data. Nowadays we can see that a lot of people sharing videos on Facebook so that is generating a huge amount of data. The most challenges part is here that the data is not presenting in structure mannered and same time it is huge in size. As we see that not only data is generated in huge amount but it also generated in a different format. e.g Data generated with videos that are actually in an unstructured format the same goes for images, So there are numerous means million of ways that data are generated nowadays that are conveying to big data.
4. Other Factors
All of us must visit websites like Amazon, Flipkart, etc. Suppose we want to buy a t-shirt or jeans so we search for a lot of t-shirts or jeans somewhere our search history will be stored. If we buy for the first time so there will be our purchase history as well along with personal details and there is numerous way in which didn't know that we generating data and also Amazon is not present earlier. So that time there is no way such a huge amount of data was generated. Similarly, data is evolving due to some other reason as well like Banking & Finance, Media & Entertainment, Healthcare, and Transportation, etc.
So now the main point as what exactly the big data is, how we consider the data as big data.
What is Big Data
Now look at the proper definition of big data "is the term for the collection of large and complex data sets that it becomes difficult to process using on-hand database system tools or traditional database applications".
What we understand from this that our traditional system or our old system can process our data?
No, there is too much data to process. When the traditional system was invented at the beginning we never decapitated that we have to deal with such numerous amount of data.
How do we consider some data as big data or how do we consider to classify data as big data? So we have 5 V's of big data.
5 V's of Big Data
If we can see some people write about 3 V's and some people write that there are 3 V's but here we will discuss the 5 V's. So look it the below discussion to understand how the data become big data due to these five characteristics
The first V of the big data is the volume of the data which tremendously large. So if we look at the diagram the volume of the data is increasing exponentially. We were dealing with 4.4 zettabytes of data in 2017 it will increase up to 44 zettabytes in 2020 which is equal to 44 trillion gigabytes. So that is really huge data.
All the humongous data coming from multiple sources that is the second V's variety. We deal with different kind of files that is all in once mp3 files, videos, Jason, CSV, TSV and many more. Now if we look at these data that are Structure, Un-Structured and Semi-Structured all together. Let us explain from the below diagram. We have Audio file, Video, Png, JSON, Log file, emails various format of data. Now, this data is classified into three forms.
I. Structured Format
In Structured format, we have a proper scheme of our data we will know what are column would be there and basically, we know about the scheme of our data, so it is in structured format means in tabular form.
II. Semi-Structured Format
The second is the Semi-Structured format, So we can see from the diagram it is nothing but JSON, XML, CS V, TS V, and email where is scheme is not defined properly.
III. UN-Structured Format
In UN-Structured form, We have Log file, Audio file, video file, and all type images file consider in the UN-Structured format.
It is also because of the speed of accumulation of this variety of data altogether which brings us to our third V's is called velocity. Let us explain from the diagram we were using mainframe computer system huge computer but having less data because there were fewer people were working with the computer at that time. As the computer evolve to become the client-server model and the time came for the web application and the internet boots. As day by day, the web application increase on the internet and now everyone is using these applications from the computer as well as from their mobile devices. More user more appliances, more apps, and more mobile devices enhance a lot of data.
When we talk about people to generate data our first thing coming in our mind is social media. If you think that how much data is generating by an Instagram alone on your post and stories.
We will talk about every social media application. If you see the below diagram for every 60 seconds social media apps generate, Twitters generate about 100 hundred Tweets in every minute, on Facebook 695,000 status update, 11 million Instagrams messages, 698,445 Google searches, 168 million emails sent in every one minute, which is almost equal to 1,820 Terabytes of data generated, also mobile users are increasing in every minute. There are 217 new mobile users are added in every minute. So that is a lot of data to calculate, to arrange in a proper manner so it becomes big data.
Now the bigger problem is here to extract useful data. So due to this reason, we come to the next V's that is Value. First, we need to mine useful content from our data basically we make sure that we have some useful field in our dataset and after that, we perform some certain analytics on that data we have to clean it. after analysis on the dataset, it has some value that is it will help us in business to grow that can be found inside which is possible earlier. Whatever the big data or data has been generated it makes sense it will help us to grow our business and have some value.
Now getting the value from that data is a big challenge that brings us to the next V's is Veracity.
So that big data has a lot of uncertainty and inconsistencies. When we are dumping such a huge amount of data some of the data package bound to a loss in processing. So we need to do that to fill up these missing data then start mining again then processes it and then come up with good inside possible. If we look at the below diagram some of the data is missing, some of is minimum value and some of the data have a large value.
We have a lot of problem in big data and a lot of opportunities that we will discuss in the next article.