Machine Learning & Big Data

Tuning Machine Language Models for Accuracy

BY

Continuing with our explanations of how to measure the accuracy of an ML model, here we discuss two metrics that you can use with classification models: accuracy and receiver operating characteristic area under curve. These are some of the metrics suitable for classification problems, such a logistic regression and neural networks. There are others … [Read more...]

Bias and Variance in Machine Learning

BY

The risk in following ML models is they could be based on false assumptions and skewed by noise and outliers. That could lead to making bad predictions. That is why ML cannot be a black box. The user must understand the data and algorithms if the models are to be trusted. So here we look at some more measures of trustworthiness. As in the … [Read more...]

Mean Squared Error, R2, and Variance in Regression Analysis

BY

Here we introduce some terms important to machine learning; variance, r2 score, and mean square error. We illustrate with these concepts using scikit-learn. It is important to understand these metrics to determine whether regression models are accurate or misleading. Following a flawed model is a bad idea. So it is important that you can … [Read more...]

Data Integrity vs Data Quality: What’s the Difference?

BY

Big Data has been widely labeled as the new oil and the new black gold – parallels that describe the value of big data to our economy and business. However, the analogy only fits in limited situations. Big Data becomes a truly valuable commodity only when the data is of high quality determined based on a range of qualitative and quantitative … [Read more...]

Getting Started with scikit-learn

BY

Here we explore another machine learning framework, scikit-learn, as well as show how to use matplotlib, to draw graphs. The scikit-learn python ML API predates Apache Spark and TensorFlow, which is to say it has been around longer than big data. It has long been used by those who see themselves as pure data scientists, as opposed to data … [Read more...]

How Malwarebytes uses big data and DevOps to keep millions of computers protected around the world

BY

I’ve been around data (and now big data) for the last 20 years, working at companies like Apple, GoPro, Roku and Malwarebytes. And one thing I’ve learned is that we’re all on a big data journey. In my current role at Malwarebytes I lead the Data and Artificial Intelligence team. Malwarebytes is 100% focused on creating the best disinfection and … [Read more...]

Top 5 Machine Learning Algorithms for Beginners

BY

Machine learning is a major component in the race towards artificial intelligence. Whether you’re seeking true artificial intelligence or simply trying to gain insight from all the data you’ve been collecting, machine learning is a major step forward. But where to get started? If you’re a beginner, machine learning can feel overwhelming – how to … [Read more...]

Introduction to Spark’s Machine Learning Pipeline

BY

Here we explain what is a Spark machine learning pipeline. We will do this by converting existing code that we wrote, which is done in stages, to pipeline format. This will run all the data transformation and model fit operations under the pipeline mechanism. The existing Apache Spark ML code is explained in two blog posts: part one and part … [Read more...]

NLU vs NLP: What’s the Difference?

BY

In the 21st century, computers can analyze all sorts of data, providing insights and performing tasks based on the learned outcome. When that data is language, however, it is a whole different world. Asking a computer to process real-world language is more complicated and difficult to mine in an efficient manner that offers productive results. … [Read more...]