The last time we attended a Big Data meetup, we learnt about Real time analytics and machine learning. On the 15th of November 2017, we found ourselves once more at the Virtusa auditorium at Orion City for another edition of the Colombo Big Data meetup.
We had Dinesh Asanka up first
The first topic of the day or rather, the evening at the Colombo Big Data Meetup was Linguistic analysis in data warehousing. The session was conducted by Dinesh Asanka. Dinesh, a visiting lecturer at SLIIT, spoke about why linguistic analysis is needed. He started off by defining what linguistics is.
According to Dinesh Asanka, Linguistics is the scientific or systematic study of language. We essentially use language to communicate our ideas to someone else. We use various terms to enhance the base value of somebody or something. In languages there are different rules. Some are historical, some are scientific.
He then spoke about typical linguistic terms such as low medium and high which are used to measure a given dataset. These can be enhanced by adding terms such as not low, very low, low medium, high medium etc. You can even combine them to create terms such as “low and not high” etc.
What is a data warehouse?
Dinesh’s next topic at the Colombo Big Data Meetup was about a data warehouse. A data warehouse is an enterprise wide framework. A framework that can be used to build your own customized platform for it. This is not only a place to store all our data, but also it’s about comprehensive technology. You take data from sources, filter said data and then load it to your data warehouse. From there, you can carry out numerous data analytics.
If you’re analyzing sales, there’s no set path to analyze it. That’s where the framework comes into play. It gives you a choice of parameters to use. In modern terms, data can be anything. Be it a tweet, a Facebook status, a video file, an audio file, almost anything can be perceived as data.
There are limitations in a traditional data warehouse. When you want to analyze something, you must first label it. For example, if you want to find the age groups of a given dataset, you would give them labels. That is called the Bucketized method. With these analytics, you can generate a report.
Asanka’s next topic was about fuzzy theory. In mathematics, fuzzy sets are sets whose elements have degrees of membership. Essentially this is where every object should have a weightage. The idea is to get some ideas of fuzzy into warehousing. Using an example of fuzzy theory, Asanka also spoke about Fuzzy membership and how data has a weightage and not just a label.
Asanka also touched on various membership functions in Fuzzy theory. But where does linguistics come in? In Fuzzy theory, there are a number of operations such as concentration, Fuzzification, Simple operations etc., all of which Asanka spoke about.
So in essence, if you used a traditional warehouse, you would only get small range descriptions, but with Linguistic Analysis, you would have a number states to show a more detailed report with greater analysis. This effort is to bring in linguistic variables to give greater understanding to your data to make more informed decisions.
Next up was Dinesh Priyankara
In case you don’t know what Azure HDInsight is, it is a product of Microsoft that deals with Big Data analytics. In essence, Azure HDInsight is a managed Open Source Big Data analytics service for the enterprise. Priyankara spoke about Hadoop clusters and how to create a Hadoop cluster within a matter of minutes using Azure HDInsight.
So first of all, what is Hadoop?
Well, Priyankara explains it as a scalable fault tolerant open source programming framework for distributed processing and distributed storage for large datasets on commodity hardware as a cluster. Originally conceptualized by Google, the Hadoop concept was developed by Yahoo. Priyankara also spoke about why we would need to carry out instructions across multiple machines or environments.
Getting into slightly more technical terms, Priyankara explained that Hadoop is a cluster configured with a minimum of two servers: a name mode and worker nodes. You can add as many worker nodes as you want to distribute the workload. He then spoke of the three components in a Hadoop system. These are HDFS (Hadoop Distributed File System), MapReduce, and YARN (Yet another resource negotiator).
WHat is Azure HDInsight?
Priyankara then went on to speak about his topic: Azure HDInsight. This is essentially Apache Hadoop running on Microsoft Azure. The key advantage is that a Hadoop system can be setup within 15-20 minutes.
Azure HDInsight is fully managed and completely open source. Priyankara also spoke about HortonWorks. This is a big data software company that develops and supports Apache Hadoop, for the distributed processing of large data sets across computer clusters. If a Hadoop update appears, you don’t need to worry. HortonWorks will update everything accordingly and all the latest features of Hadoop will be made available on Azure HDInsight.
Priyankara then went into detail about how a Hadoop cluster is set up and the process involved. During this, he also spoke about Azure HDInsight Cluster types such as Apache Hadoop, Apache spark, Apache Storm, Apache Hbase, Interactive Hive, Microsoft R Server etc. Priyankara also spoke about ome vital settings such as Name and Subscription, Cluster type, OS, HDinsight version, Cluster tier, Resource group etc.
Another Colombo Big Data Meetup came to an end
With a few more examples and demonstrations, Dinesh Priyankara’s session came to an end. In closing, he emphasized that the community is still growing and h encouraged those attending to feel free to sign up to give a presentation at the next meet up. And thus, latest edition of the Colombo Big Data meetup came to an end.