top of page

Make the most of 30 day free trial on Azure Services - Big Data Solutions with Azure Machine Learnin

As a postgraduate student in Data Science, I am encouraged to get a certificate from Microsoft Professional Program as a way to make myself outstanding in a competitive job market. Thus, I gave myself a try by registering an account to use Azure services and deciding to start with 2 courses in the Data Science track of Microsoft Professional Program. I thought this is the part that I lack experience in since I did not have many chances to get familiar with big data tools such as Hadoop and Spark, especially building machine learning solutions using cloud service like Azure and AWS. Though I did not get Verified Certificates for these courses on Edx.org (well, I am a poor student), I think the knowledge that I get is really useful because most of companies these days deploy their data models on cloud platforms such as Azure or AWS. Hence, it is necessary for a Data Scientist like me to get proficient with these tools.

2 courses that I enrolled in are:

In this article, I will give some highlights that I learned in the first course. The second one will be introduced in the next article. Here is a list of all those tools I have used in the first course:

  • Azure Machine Learning Studio: Create predictive models from different data sources using useful built-in modules of Azure

  • Azure App Service: Create web service from an Azure Machine Learning Model

  • Azure Data Factory: Use Azure Machine Learning in Batch Processes

  • Azure Stream Analytics: Use Azure Machine Learning in Streaming Processes

  • Azure SQL Database: store details of the data after being batch processed, then import data from Azure SQL database into an experiment or Jupyter Notebook

A more-detailed description of Microsoft's Azure Machine Learning Studio:

It offers a robust set of tools designed to develop, deploy and manage machine learning projects. It incorporates open source Jupyter Notebooks, sample datasets and algorithms, and predesigned modules to aid in project development and management. The platform allows users to deploy applications and predictive models as a web service from Machine Learning Studio.

This includes Azure Data Factory, Azure Stream Analytics, Azure HDInsight, Azure Data Lake and Power BI. The environment supports both supervised and unsupervised learning. The environment also supports Python, R Script and open source Scikit-learn, TensorFlow, PyTorch, CNTK, and MXNet.

The platform also supports Docker containers. This produces an overall framework that is flexible and highly scalable. Machine Learning Studio provides comprehensive features across the full range of descriptive, diagnostic, predictive and prescriptive analytic types.

In this course, I tried to build a classification model for news articles using natural language processing techniques in Azure Machine Learning Studio. The dataset that I used is provided from one of the units I studied in university. The data contains more than 130,000 articles that are scrawled from online sources and are categorized into 23 classes. Here is another example of models that I have developed for this dataset using Jupyter Notebook on Kaggle.

There are 4 main steps in this pipeline:

- Step 1: clean and preprocess data

- Step 2: extract numeric feature vectors from preprocessed text

- Step 3: train classification model

- Step 4: score and validate the model

Step 1: clean and preprocess data

We clean the text using Preprocess Text module. The cleaning reduces the noise in the dataset, help you find the most important features, and improve the accuracy of the final model. We remove stopwords - common words such as "the" or "a" - and numbers, special characters, duplicated characters, email addresses, and URLs. We also convert the text to lowercase, lemmatize the words, and detect sentence boundaries that are then indicated by "|||" symbol in pre-processed text.

What if you want to use a custom list of stopwords? You can pass it in as optional input. You can also use custom C# syntax regular expression to replace substrings, and remove words by part of speech: nouns, verbs, or adjectives.

Step 2: extract numeric feature vectors from preprocessed text

To build a model for text data, you typically have to convert free-form text into numeric feature vectors. In this example, we use Extract N-Gram Features from Text module to transform the text data to such format. This module takes a column of whitespace-separated words and computes a dictionary of words, or N-grams of words, that appear in your dataset. Then, it counts how many times each word, or N-gram, appears in each record, and creates feature vectors from those counts. In this case, we set N-gram size to 2, so our feature vectors include single words and combinations of two subsequent words.

We apply TF*IDF (Term Frequency Inverse Document Frequency) weighting to N-gram counts. This approach adds weight of words that appear frequently in a single record but are rare across the entire dataset. Other options include binary, TF, and graph weighing.

Such text features often have high dimensionality. For example, if your corpus has 100,000 unique words, your feature space would have 100,000 dimensions, or more if N-grams are used. The Extract N-Gram Features module gives you a set of options to reduce the dimensionality. You can choose to exclude words that are short or long, or too uncommon or too frequent to have significant predictive value. In this tutorial, we exclude N-grams that appear in fewer than 5 records or in more than 80% of records.

Also, you can use feature selection to select only those features that are the most correlated with your prediction target. We use Chi-Squared feature selection to select 1000 features. You can view the vocabulary of selected words or N-grams by clicking the right output of Extract N-grams module.

As an alternative approach to using Extract N-Gram Features, you can use Feature Hashing module. Note though that Feature Hashing does not have build-in feature selection capabilities, or TF*IDF weighing.

Step 3: train classification model

Now the text has been transformed to numeric feature columns. The dataset still contains string columns from previous stages, so we use Select Columns in Dataset to exclude them.

Since we are solving a multi-class classification problem, unfortunately Azure has not offered a module for multi-class classification. We will use Two-Class Support Vector Machine as an input for our One-vs-All Multiclass module to predict our target. At this point, the text analytics problem has been transformed into a regular classification problem. You can use the tools available in Azure Machine Learning Studio to improve the model. For example, you can experiment with different classifiers to find out how accurate results they give, or use hyperparameter tuning to improve the accuracy.

Step 4: score and validate the model

How would you validate the trained model? We score it against the test dataset and evaluate the accuracy. However, the model learned the vocabulary of N-grams and their weights from the training dataset. Therefore, we should use that vocabulary and those weights when extracting features from test data, as opposed to creating the vocabulary anew. Therefore, we add Extract N-Gram Features module to the scoring branch of the experiment, connect the output vocabulary from training branch, and set the vocabulary mode to read-only. We also disable the filtering of N-grams by frequency by setting the minimum to 1 instance and maximum to 100%, and turn off the feature selection.

After the text column in test data has been transformed to numeric feature columns, we exclude the string columns from previous stages like in training branch. We then use Score Model module to make predictions and Evaluate Model module to evaluate the accuracy.

End note

Loosely speaking, it is more convenient to implement data engineering in Jupyter Notebook rather than doing it in an experiment in ML studio. In the other ways, this service offers you an ability to drag-and-drop feature to create a model without any requirement of coding experience. Thus, both approaches have their pros and cons. In my opinion, I would conduct data cleansing using Jupyter Notebook in my local machine and load these data to create the mode in Azure ML since it will run much faster than my local machine. Hence I can significantly reduce my waiting time to develop a machine learning model.

This is the first time I try this service and the performance made me really surprised. When I run a support vector machine model on a 600MB dataset, and generated cross validation result for it, it took me over 8 hours (actually i had to wait overnight for the process to finish). But when I tested the same dataset on Azure service, it took me less than 20 minutes to finish 5 fold cross validation for 3 different models including support vector machine, decision tree and a default neural network model. AMAZING!!!!

Important reading:


bottom of page