Data Visualization Guide – BMC Software | Blogs https://s7280.pcdn.co Mon, 29 Nov 2021 14:36:58 +0000 en-US hourly 1 https://s7280.pcdn.co/wp-content/uploads/2016/04/bmc_favicon-300x300-36x36.png Data Visualization Guide – BMC Software | Blogs https://s7280.pcdn.co 32 32 The Data Visualization Beginner’s Guide https://s7280.pcdn.co/data-visualization/ Mon, 29 Nov 2021 00:00:41 +0000 https://www.bmc.com/blogs/?p=12848 “The medium is the message.” Data visualization is the graphical representation of data. It is a way to communicate the overall meaning of data points in a simple and meaningful way. Where a picture is worth a thousand words, a data visualization is worth a thousand data points. In a world of big data, where […]]]>

“The medium is the message.”

Data visualization is the graphical representation of data. It is a way to communicate the overall meaning of data points in a simple and meaningful way. Where a picture is worth a thousand words, a data visualization is worth a thousand data points.

In a world of big data, where organizations and entities such as businesses, the weather, traffic, and customer acquisition might contain hundreds or thousands of data points, your message is more impactful if you use a graph rather than simply displaying an Excel spreadsheet.

The data points can be presented in multiple forms to give a different message. It takes some discipline to understand the message you wish to convey in order to organize the data points into the proper visualization.

(This article is part of our Data Visualization Guide. Use the right-hand menu to navigate.)

Data Visualization is an everyday skill

Creating a data visualization is an important career skill as more businesses are making data-driven decisions. Everyone is a data scientist. Whether you are a data scientist filtering through a bunch of data to find out what makes it unique or you are on a sales team, attempting to understand your target customer, you will likely use some form of data visualization to explain what is happening.

Fortunately, making a visualization is very accessible. Because it is an everyday skill, the visualization tools and articles explaining how to use them are abundant—for all data sources, and most common kinds of data types. The most common types of visualizations are:

Each of these can be explored for more specific representations. The best things to do to improve your skills are exploring the infographics that come across your path, read through the data charts, and practice creating some of your own at your own job.

(Get inspired with this gallery of beautiful data visualizations. Check out Nadieh Bremer’s portfolio at Visual Cinnamon.)

The importance of visualization

Data visualization is so essential in communicating what is happening with one’s data. In fact, it is the consumer’s expectation that a service should offer some form of visualization tool in order for the consumer to:

  • Interact with their platform
  • Give insights into what is happening with their activity

Whatever platform you might be using will generally have a data visualization element to it. Amazon, Azure, and Google all have their built-in data visualization tools to monitor all kinds of usage metrics within their platforms.

Companies like Salesforce, web hosting services, and social media companies, too, have created visualization tools for their content creators over the past few years so those people who create content on their service have easy access to understand what is happening between themselves and their audiences so they can better craft their message.

Top data visualization tools

Whether working individually or as part of a business, here are the top visualization tools available today:

  • Tableau
  • Grafana
  • Google Charts
  • d3.js

Data visualization best practices

When looking at data and determining how to visualize it, here are some key questions to ask yourself.

Good questions to ask

  • How can I represent the information before me in a graphical representation?
  • Can I say it all in one representation or do I need a few?
  • Is the information time-sensitive?
  • Does it represent growth of something?
  • Does it represent frequency of occurrences?

Does it need to represent

  • Dependence on time
  • Growth
  • Frequency of occurrence
  • Dependence on location

Does your audience prefer beautiful or functional charts?

Craigslist’s design has remained relatively the same for a couple decades, and its success is in large part due to its functional use. It is a boring design—sure—but one that works to get information to people by its own design staying out of the way.

Charting is similar.

Some of the visualizations the big platforms use are very basic because they just have to communicate something simple, but visualization dashboards that have users come back to it every single day often require a little more design considerations because it’s users want to work with something that is pleasant on the eyes.

What message does the audience want?

The best visualizations are often the simplest ones that serve one purpose:

  • User activity charts display user activity over time, either through a heatmap or a line chart.
  • Sales volumes show cumulative or daily line charts.
  • User activity based on region illustrates a geographical map with color-based activity overlaid.

But some users require visualizations that display information at a high level. They need to be able to dig deeper into it. Dashboards are particularly handy for:

  • Displaying lots of data
  • Allowing the user to dig deep when they see something on the chart that strikes them as interesting

Get involved with a community

“Imitation is the sincerest form of flattery.”

Messages are bits of communication from one person to another. The information takes a form and garners a particular style. If you wish to be an effective communicator when crafting your visualization, become familiar with how other people are already crafting their message.
Join communities of people who are sharing visualizations with one another. Look outside that community and observe how other people are talking with one another. Like any type of communication, getting better at representing your data requires:

  • Practice
  • Mimicry
  • Exposure to what options are out there

(Gain exposure with Data Is Beautiful & these blogs.)

Seek feedback

Finally, visualizations rely on communication, so it is a good idea to get feedback from your audience. Pay attention to whether your audience is asking a lot of questions when you share your visualizations. Things as simple as decreasing line widths and using a bold font can improve your visualization’s effectiveness. If it’s not working, change it.

Good luck getting going!

Related reading

]]>
Introduction To Graph Databases https://www.bmc.com/blogs/graph-databases/ Wed, 28 Oct 2020 12:25:53 +0000 https://www.bmc.com/blogs/?p=19038 Data rules the tech world. Data, data, data. There is data that needs to be seen by a user, data to be reviewed by a data scientist. Data for investors. Data for management. Data. For any organization, how you structure and store your data informs your potential success. The company storing their data in hand-written […]]]>

Data rules the tech world. Data, data, data. There is data that needs to be seen by a user, data to be reviewed by a data scientist. Data for investors. Data for management. Data. For any organization, how you structure and store your data informs your potential success.

The company storing their data in hand-written Microsoft Word documents is a step behind the company storing their data in easy-to-read Excel charts on a cloud server. Innovation continues in the database world—these are graph databases.

(This article is part of our Data Visualization Guide. Use the right-hand menu to navigate.)

How traditional databases work

Data has traditionally been stored in tables. For the coffee shop industry, you may probably see a set of tables appear like this:

Coffee Shop
Name
NOVO
Steam
Thump
Corvus
Customers
CustomerID Name BDay Email Phone
123461 Jack 1-1-1970 a@b 1
41263 Jim 1-1-1970 b@c 2
45714 Jill 1-1-1970 c@d 3
71129 Jerome 1-1-1970 d@e 4
90123 Jeb 1-1-1970 e@f 5

Then, there are separate tables to store all the orders, employees, inventory, and suppliers. We have 6 data tables to store the data of the coffee shop industry. In the traditional relational database, each new relationship that is wished to be known requires a new table to store information.

While inventory items might work well in tables, they are limited when expressing relationships between entities.

In traditional databases, answering relationship questions requires IDs to be double-entered across tables, then a number of join statements to say something like, “Return all values where ID-1 matches other ID-1’s in all these sets of tables. Then, look at other common IDs that appear in the tables with ID-1, and also find those IDs across all sets of tables.”

The Cornell Movie Dialog Corpus is a traditional database. To know who said what in what movie, you have to cross-reference the IDs with the movie line to be able to see that the line “Here’s to looking at you, kid” was spoken by character ID 546 (Humphrey Bogart) and movie ID 123 (Casablanca).

You can answer relational questions with traditional SQL and noSQL databases, but only through a well-thought out question that states explicitly what sets of tables to explore and to join (tremendous amounts of work). Then, wait a while for the query to be returned.

Queries about relationships in non-graph databases don’t scale well. The relationship questions in relational databases can take minutes, where the same query in a graph database can take seconds.

What is a graph database?

As people move into an increasingly interconnected society, connections need to be expressed. From social media feeds to show who are friends of whose, to recommended lists to show people that like these songs or videos also like these, to identifying the rogue member of the group who is most likely a threat, graph databases help store this information and query this information with ease.

Graph databases are used to understand highly interconnected data. They are great at exploring the relationships between data. By design, graph databases can easily answer these types of questions:

  • Do the user’s friends like the same music as the user?
  • Is it the case that this fraudulent user could be detected because it has fewer relationships to other members of the group?
  • If this person knows these types of things, what are the chances they will connect the dots to learn these other types of things?

Use cases

Graph databases are schema-less and mutable. Graph databases are particularly good to use when you need to:

  • Explore the relationships between data
  • Easily scale queries to increasing amounts of relationships

Example use cases for graph databases include:

  • Fraud detection
  • Real-time recommendations
  • Data management
  • Identity management
  • Network and IT operations

The three graph components

The concept of a graph is a math term, originating with graph theory. The graphs of graph theory can be thought of as trees, networks, webs, or mind-maps. Pick your demon. They all consist of the same parts:

  • Node
  • Edge
  • Graph

Node

The basic element of the graph is the node. These are each point along the graph:

  • One node contains data like the customer name and the customer coffee choice.
  • Another node on the same graph might have the name of a coffee shop, its address, and its hours of operation.

Graph Node

Edge

The edge on the graph defines the relationships between the entities. For our two customers, above, we can say they both “attend” the coffee shop. The nomenclature will vary based on your organization and needs. For example, “attends” can be replaced with customer, shopper, patron, user, etc.

Graph Edge

Graph

Finally, we have our graph of nodes and edges. From it we can express the relationships such as, “Customer 1 and Customer 2 both attend Coffee Shop 1” and “Only Customer 1 attends Coffee Shop 2”.Graph

We could add many more relationships to this graph. Within this chart we could:

  • Show the suppliers to each coffee shop.
  • Add family and friends of these two customers, creating an edge “family”, “brother”, “friend” to the network and show what their coffee preferences are, then also use the “attends” edge to express which shops they attend.

These graphs can get incredibly complex.

Popular graph tools from cloud providers

Neo4j is one of the biggest players in the graph world, and they are creating a Kubernetes infrastructure that can implement the graph database on any cloud service. This means you can implement graph databases on AWS, Azure, and GCP.

While Neo4j takes some extra work to get up and running, AWS offers Amazon Neptune, a ready-to-go graphs infrastructure on AWS. Another good graph alternative is Cayley, which is open source and usable on GCP.

What other good graph options have you come across? Please reach out and let us know.

Additional resources

For related reading, explore these resources:

]]>
Enabling the Citizen Data Scientists https://www.bmc.com/blogs/citizen-data-scientist/ Mon, 01 Jun 2020 00:00:54 +0000 https://www.bmc.com/blogs/?p=17541 More businesses are invested in statistics and analytics, creating a need for people who excel at working with a large volume of high-velocity data. Data science is a lucrative profession that generally requires many years of schooling that results in a PhD. However, a citizen data scientist is an emerging profession that benefits companies in […]]]>

More businesses are invested in statistics and analytics, creating a need for people who excel at working with a large volume of high-velocity data. Data science is a lucrative profession that generally requires many years of schooling that results in a PhD.

However, a citizen data scientist is an emerging profession that benefits companies in need of mid-level employees who have knowledge of data principles and analytics. In this article, we’ll talk about the role of the citizen data scientist, what it is, how businesses benefit from it, and how to become a citizen data scientist.

(This article is part of our Data Visualization Guide. Use the right-hand menu to navigate.)

What is a citizen data scientist?

A citizen data scientist is someone who works with Big Data to create modeling and analytics that is more advanced than a layperson, but not operating at the level of a data scientist. The term “citizen” implies it’s someone without formal training in data science, and the data scientist part of the title acknowledges they have the practical knowledge to be successful.

Citizen data scientists don’t replace the need for data scientists who go through the training and education to have the skills to analyze large quantities of data, create advanced models, deduce important statistics and create business metrics using insights from data. Without formal training, citizen data scientists have the skill to create moderately complex models, but usually lack what’s necessary to work with advanced data analytics and modeling.

Still, becoming a citizen data scientist is an increasingly popular choice for people who aren’t ready to dive into a PhD but are skilled with data, because there is a greater need for talented data professionals than there are data scientists to fill the roles. This makes for an especially lucrative profession for the right person who is ready to take their mid-tier data modeling and analytics skills to the next level of employment.

Who can become a citizen data scientist?

To become a citizen data scientist you need to have the following skills and traits:

  • Organizational context: The right person is someone who understands the vision, mission, and needs of the company, and how data helps propel their needs.
  • Divergent thinking: The ideal citizen data scientist can think outside the box, coming up with data models and connections that go beyond what the average layperson would conceptualize.
  • Strong analytical skills: A citizen data scientist must be analytical as a hallmark of the role. Being able to perform fairly complex data analysis is part of the job.
  • Ability to assess information meaningfully: It’s important for a citizen data scientist not only to assess the data in front of them logically but also to draw meaningful conclusions from it that the average person might not see.
  • Emphasize business value: In data analysis, a citizen data scientist must be able to emphasize the value in what they are doing in order to develop into the position as a promotion from their existing duties.
  • Industry adjacency: The best candidates for a citizen data scientist work in a field that is adjacent to data science, something with lots of math and analytical processes. Software developers and engineers might be good candidates for the role.

The role of data science in enterprise business

In enterprise business, data science provides important insight into the variables that are at the core of a company’s business model. Things like customer traits and activities, sales growth, resource consumption, and employee retention can all be modeled using data science.

Data models are at the heart of data science. These visual components offer easily digestible information about how points of data relate to one another, and the goals of the company. These models determine how data is structured and compiled in databases and other warehousing and analytics tools that offer deep insights and make them usable. They give context to important data.

Creating a data model requires an understanding of programming and data relationships. In recent history, it’s a role that’s been assumed predominantly by data scientists. But as the need for consistent, quality data, analytics, and insights continues to dominate digital businesses, some people in adjacent roles have been able to propel themselves into a new career–citizen data scientist.

While a scarcity of data scientists has encouraged business leaders to consider their options, technology is driving the democratization of data, modeling, and analytics. This is an exciting time for businesses who need to understand the important insights Big Data has to offer. With platform technology that makes modeling easier, the right candidate for the role can expand their impact on the organization while leaving room for a data scientist to create advanced models that might be necessary for growth.

How to become a citizen data scientist

The fastest path to becoming a citizen data scientist is to reskill or upskill with your employer.

For the sake of effectiveness, it’s a good practice to work in an adjacent field. A career like backend software development or engineering can be a good fit because the roles require a comprehensive understanding of math, computer science, relationships, and coding. Moving into a citizen data scientist role from one of the above-mentioned ones only requires you to upskill, rather than gaining a whole new skillset. Some people who work in these fields are para-professional data modelers already and only need to be able to show and prove their existing skills to move into the role.

If you’re not in an adjacent role, you can opt to take certification courses that offer the skills you need to be an effective citizen data scientist. This might include certifying in Tableau or Python or taking courses that are specific to data science to get a basic foundation. If you have a strong understanding of the company vision and need for data, you may be able to work with your company to reskill and obtain the right certification.

Either path requires extensive knowledge of data software, and the ability to use it effectively. In a digital business economy, more professionals understand the need to diversify their software skills in order to create new and exciting roles that benefit both themselves and their companies.

]]>
How to Draw 3D Charts with Matplotlib https://www.bmc.com/blogs/matplotlib-3d-charts/ Thu, 02 Jan 2020 00:00:36 +0000 https://www.bmc.com/blogs/?p=16179 In this article, I’ll show how to draw three-dimensional charts in Matplotlib. To plot charts in Matplotlib, you need to use a Zeppelin or Jupyter notebook (or another graphical environment). Your other option is to save your charts to a graphics file in order to display them later. This is particularly useful if you’re executing […]]]>

In this article, I’ll show how to draw three-dimensional charts in Matplotlib.

To plot charts in Matplotlib, you need to use a Zeppelin or Jupyter notebook (or another graphical environment). Your other option is to save your charts to a graphics file in order to display them later. This is particularly useful if you’re executing a long-running program that takes too many minutes or hours to run in an interactive notebook, such as a machine learning model.

(This article is part of our Data Visualization Guide. Use the right-hand menu to navigate.)

Introduction to 2D charts

I’ll start with a very easy explanation of the basic concepts. The flat surface of a chart is also known as the cartesian plane. You remember from high school that each point (x,y) is a point on this x-y flat surface.

The plot of a line with a 45-degree angle, for example, is f(x)=y=x. We usually write y=f(x) to mean x is a function of y.

Let’s plot y = sin(x) for the familiar curve that ranges between 1 and -1. Here we go.

We first fill an array of enough data points to make a smooth chart. In particular we set them 0.1 apart and range from -5 to 5:

x = np.arange(-5,5,0.1)

The rest of the code is simple. bo means blue circle.

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(10,10))

x = np.arange(-5,5,0.1)
y = np.sin(x)

plt.plot(x,y,'bo')

Now, let’s add a third dimension to our first chart.

3D charting in Matplotlib

First, a caveat: People don’t use 3D charts often, mostly because readers have a difficult time understanding the charts. For example, it’s easy to read a 2D time-series chart, with time on the x-axis and y on the vertical axis. But if we add a third z-point, it’s floating in space, often resembling a blob and making the meaning hard to grasp.

Here, I will do something simple—place our flat sine curve in 2D space. This is technically called a hyperplane, since it has no dimension in the 3rd dimension.

In 3D space, each coordinate is given as (x,y,z). In calculus class, you’re used to seeing the coordinate system expressed like this:

The point z is a function of x and y. In charting, the principle is the same except these arrows (axes) are moved to the middle of the box, enclosing the chart space. The tick marks (e.g., -2, -1, 0, 1, 2) are drawn on the box that encloses those axes, which has not been moved to the origin, i.e. (x=0,y=0,z=0).

The direction of the z arrow can be up, out, or across. It’s just a matter of picking which orientation you find easiest to understand.

3D chart example

Here is a sample chart. We will draw the same sine curve as we drew on the flat cartesian plane. But here we will set z to the constant value 0. Thus, all points are on the hyperplane (x,y,0).

This creates the illusion of 3D space, making our graph appear to float in space. That’s the whole point of making 3D charts: to add one more axis to the visual presentation. Of course, humans cannot see any dimension beyond three, certainly not four dimensions.

Since this chart is a curve we can just use plot(), since we don’t have pointing jumping all over the place. (If we did we would probably use scatter()).

We first set out canvas to 3D. And this method gives us access to the axes object:

from mpl_toolkits.mplot3d import Axes3D
ax = fig.gca(projection='3d')

Here we will set z, written as zs, to 0. So, this is the plot of (x,y,0). zdir means which direction to 3D. The default size might otherwise be too small.

Then we can use the plot() method:

fig = plt.figure(figsize=(10,10))
ax.plot(x,y,'bo', zs=0, zdir='y')

The code

Here is the complete code. The rest of the code we use to show the origin, point (0,0,0), at the center so it’s easier to see.

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np

x = np.arange(-5,5,0.1)
y = np.sin(x)
fig = plt.figure(figsize=(10,10))
ax = fig.gca(projection='3d')
ax.plot(x,y,'bo', zs=0, zdir='y')

origin = [0,0,0]
ax.text(origin[0],origin[0],origin[0],"origin",size=20)

ax.set_xlabel('X',labelpad=10,fontsize='large')
ax.set_ylabel('Y',labelpad=10,fontsize='large')
ax.set_zlabel('Z',labelpad=10,fontsize='large')

fig.show()

Of course, data scientists are not plotting functions most of the time. They are plotting arrays. But using functions is the easiest way to illustrate this, as most programmers are familiar with those.

]]>
Plotly Python Tutorial https://www.bmc.com/blogs/plotly-python/ Thu, 21 Nov 2019 00:00:11 +0000 https://www.bmc.com/blogs/?p=15867 Plotly is a charting framework for Python and other programming languages.  What makes Plotly different is that it supports JavaScript, so it will respond to mouse events.  For example, you can make annotation boxes pop up when someone moves the cursor over the chart. Plotly does not natively handle Python Pandas DataFrames. To make Plotly […]]]>

Plotly is a charting framework for Python and other programming languages.  What makes Plotly different is that it supports JavaScript, so it will respond to mouse events.  For example, you can make annotation boxes pop up when someone moves the cursor over the chart.

Plotly does not natively handle Python Pandas DataFrames. To make Plotly work with these, you’ll need to convert those to dictionaries first or use plugins.

A large collection of charts is available in this public repository and grouped into these categories:

Note: Plotly is free, and they offer paid versions including the Chart Studio platform where they say you can create charts without programming. That’s slightly misleading, as you would still need to write code to transform the data, but you can try that for yourself.

(This article is part of our Data Visualization Guide. Use the right-hand menu to navigate.)

Plotly Python example with code and data

Here we give an example of how to draw the simplest of Plotly charts and what you need to get started with using it with Python. The code is available here and the data here.

First, you need to use Zeppelin or Jupyter notebook for a graphical environment in which you can both draw charts and display graphics.

Then, add Plotly to Python like this.

pip install plotly==4.2.1

The code, explained

Before we get into the code, a couple notes:

  • In addition to drawing the chart inline, you can save it as an HTML file. Look at that file and notice that it has both a graphic image and JavaScript to make it interactive.
  • There might be some delay in generating the chart or it might not generate at all. We tested this code with Safari and Chrome browsers. If you search online, you’ll get conflicting instructions on how to display graphs in different graphical environments. If you’re still having problems generating a chart, get in touch with us at blogs@bmc.com, include your OS and browser details, and we’ll take a look.

Now for the complete code. The data we use is diet information over an 8-day period. The first part looks like this:

Next, we import a CSV file, then plot x and y, where x is the date and y is a chosen column:

x=daily.index.values,y=daily['Carbohydrates (g)'])

Date was originally a column but since we grouped and summed the data by date…

daily = df.groupby('Date').sum()

…it became the dataframe index, so we use daily.index.values to get the values.

This is a simple bar chart (go.Bar) with x and y values.

from plotly.offline import plot

import pandas as pd

df = pd.read_csv("/home/ubuntu/Downloads/diet.csv")

daily = df.groupby('Date').sum()


import plotly.graph_objects as go
fig = go.Figure(
    data=[go.Bar(x=daily.index.values,y=daily['Carbohydrates (g)'])],
    layout_title_text="A Figure Displayed with fig.show()"
)
fig.show()

Finally, here is the resulting chart.  Hover your cursor over a point to see the point’s value. That’s JavaScript doing the work for you.

]]>
Matplotlib Logarithmic Scale https://www.bmc.com/blogs/matplotlib-logarithmic-scale/ Thu, 19 Sep 2019 00:00:19 +0000 https://www.bmc.com/blogs/?p=15508 In this article, we’ll explain how to use the logarithmic scale in Matplotlib. The logarithmic scale is useful for plotting data that includes very small numbers and very large numbers because the scale plots the data so you can see all the numbers easily, without the small numbers squeezed too closely. (This article is part […]]]>

In this article, we’ll explain how to use the logarithmic scale in Matplotlib.

The logarithmic scale is useful for plotting data that includes very small numbers and very large numbers because the scale plots the data so you can see all the numbers easily, without the small numbers squeezed too closely.

(This article is part of our Data Visualization Guide. Use the right-hand menu to navigate.)

Logarithms

First, let’s review a little high school math. A logarithm is a way to make a large number appear small by looking at it as a power of 10. There are other logarithm bases besides 10, like the natural logarithm used in mathematics, which is given by the constant e=2.718…. But, for our purposes, we will use base 10 logarithms.

In short:

log10x = y means 10 raised to power y equals x, i.e., 10 ** y = x. So log10100=2 because 10**2 = 100.

The logarithmic scale in Matplotlib

A two-dimensional chart in Matplotlib has a yscale and xscale. The scale means the graduations or tick marks along an axis. They can be any of:

  • matplotlib.scale.LinearScale—These are just numbers, like 1, 2, 3.
  • matplotlib.scale.LogScale—These are powers of 10. You could use any base, like 2 or the natural logarithm value, which is given by the number e. Using different bases would narrow or widen the spacing of the plotted elements, making visibility easier.
  • matplotlib.scale.SymmetricalLogScale and matplotlib.scale.LogitScale—These are used for numbers less than 1, in particular very small numbers whose logarithms are very large negative numbers.

Using the logarithmic scale

Let’s plot the revenue of some big companies and some small ones.

Amazon, Alphabet (Google), and Intel are many times larger than the small companies Pete’s, Clock, and Buckey’s BBQ (that we made up). The difference between them relative to Amazon is enormous, so the plot of each smaller company lies on the same vertical line 0 on the linear scale.

Matplotlib picks the scale for the axes if you do not set it explicitly. Here we did not. The little le11 notation at the bottom means that the xaxis is in scientific notation, which in this case means revenue is shown a multiple of 10**11 ($100 billion). Amazon’s revenue of $232,887,000,000 is 2.325*(10**11). While Pete’s $600,000, as a power of 11, is 0.000006*(10**11). This explains why even Pepsi, a large company, is close to 0 on the chart below as well.

So, we fix that issue in the next graph by using the logarithmic scale. See below for an explanation of the code.

import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

data = { 
        "Pete's":           600000,
        "Clock":           1600000,
        "Buckey's BBQ":    2600000,
        "Pepsi":        6466000000,
        "Intel" :      70848000000,	
        "Alphabet":    136819000000,
        "Amazon":      232887000000	
        }
        
df = pd.DataFrame.from_dict(data,orient='index',columns=['Revenue'])
dg =pd.to_numeric(df['Revenue'])
dc = pd.Series(dg.index.values.tolist()).to_frame('Company') 
dat = df.assign(Company=dc.values)

data = dat.sort_values(by=['Revenue'])
 
plt.scatter(data['Revenue'],data['Company'])
plt.grid()
plt.show()

Here, everything is the same except we included plt.xscale(“log”). Now we can more easily see the values since they are powers of 10. So, Pete’s small $600,000 revenue is easy to see as it 6*(10**6). And Amazon’s is roughly 100,000 (10**5) times as large.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

data = { 
        "Pete's":           600000,
        "Clock":           1600000,
        "Buckey's BBQ":    2600000,
        "Pepsi":        6466000000,
        "Intel" :      70848000000,	
        "Alphabet":    136819000000,
        "Amazon":      232887000000	
        }
        
df = pd.DataFrame.from_dict(data,orient='index',columns=['Revenue'])
dg =pd.to_numeric(df['Revenue'])
dc = pd.Series(dg.index.values.tolist()).to_frame('Company') 
dat = df.assign(Company=dc.values)

data = dat.sort_values(by=['Revenue'])
 
plt.scatter(data['Revenue'],data['Company'])
plt.grid()
plt.xscale("log")
plt.show()

Now you can more clearly see each company’s size by its revenue

Explaining the code

Some of this code may be difficult to understand without a familiarity with data science because it uses Pandas and NumPy. This is ironic, as Pandas was created particularly to make working with table-type data easier.

The first step takes the data we have created as a dictionary and converts it to a Pandas dataframe. The index for this data will be the company name. We said orient=’index’ that means take the first entry as the index value. Then we give it a column name with columns=[‘Revenue’]. The company name has no column name because it’s not a column; it’s a row index.

df = pd.DataFrame.from_dict(data,orient='index',columns=['Revenue'])

Next, we must convert the revenue strings to numbers. Otherwise they will be sorted by matplotlib as letters.

dg =pd.to_numeric(df['Revenue'])

Now we want to create a series, which is a dataframe with only one column. The values will be the index of the previous dataframe. That’s how we get the companies listed as a column of data, named Company.

dc = pd.Series(dg.index.values.tolist()).to_frame('Company')

Next, we add the values of the series we just created as another column in the dat dataframe.

dat = df.assign(Company=dc.values)

Then we plot the scatter chart giving it dataframes for the x and y values. Each has the same shape, (7,), which you can check with data[‘Revenue’].shape. Shape is a sometimes difficult NumPy concept. It basically means the dimension of the array. In an xy plot they must be the same.

plt.scatter(data['Revenue'],data['Company'])

The rest is self-explanatory:

plt.grid()
plt.xscale("log")
plt.show()
]]>
Matplotlib Scatter and Line Plots Explained https://www.bmc.com/blogs/matplotlib-scatter-line-plots/ Thu, 12 Sep 2019 16:10:41 +0000 https://www.bmc.com/blogs/?p=15430 In this article, we’ll explain how to get started with Matplotlib scatter and line plots. (This article is part of our Data Visualization Guide. Use the right-hand menu to navigate.) Install Zeppelin First, download and install Zeppelin, a graphical Python interpreter which we’ve previously discussed. After all, you can’t graph from the Python shell, as […]]]>

In this article, we’ll explain how to get started with Matplotlib scatter and line plots.

(This article is part of our Data Visualization Guide. Use the right-hand menu to navigate.)

Install Zeppelin

First, download and install Zeppelin, a graphical Python interpreter which we’ve previously discussed. After all, you can’t graph from the Python shell, as that is not a graphical environment.

Start Zeppelin. If you are using a virtual Python environment you will need to source that environment (e.g., source py34/bin/activate) just like you’re running Python as a regular user. This way, NumPy and Matplotlib will be imported, which you need to install using pip.

First plot

Here is the simplest plot: x against y. The two arrays must be the same size since the numbers plotted picked off the array in pairs: (1,2), (2,2), (3,3), (4,4).

We use plot(), we could also have used scatter(). They are almost the same. This is because plot() can either draw a line or make a scatter plot. The differences are explained below.

import numpy as np
import matplotlib.pyplot as plt

x = [1,2,3,4]
y = [1,2,3,4]
plt.plot(x,y)
plt.show()

Results in:

You can feed any number of arguments into the plot() function. The format is plt.plot(x,y,colorOptions, *args, **kargs). *args and **kargs lets you pass values to other objects, which we illustrate below.

If you only give plot() one value, it assumes that is the y coordinate. If you put dashes (“–“) after the color name, then it draws a line between each point, i.e., makes a line chart, rather than plotting points, i.e., a scatter plot. Leave off the dashes and the color becomes the point market, which can be a triangle (“v”), circle (“o”), etc.

Here we use np.array() to create a NumPy array. Even without doing so, Matplotlib converts arrays to NumPy arrays internally. NumPy is your best option for data science work because of its rich set of features.

Use NumPy Arrays

Here we pass it two sets of x,y pairs, each with their own color.

import numpy as np
import matplotlib.pyplot as plt

x = np.array([1,2,3,4])

plt.plot(x,x**2,'g--', x, x**3, 'o--')

We could have plotted the same two line plots above by calling the plot() function twice, illustrating that we can paint any number of charts onto the canvas.

import numpy as np
import matplotlib.pyplot as plt

x = np.array([1,2,3,4])

plt.plot(x,x**2,'g--')
plt.plot(x, x**3, 'o--')

You can plot data from an array, such as Pandas, by element name named as shown below. Below we are saying plot data[‘a’] versus data[‘b’].

data = {'a': np.arange(10),
    'b': np.arange(10)}
 

plt.scatter('a', 'b', c='g', data=data)

print(data)

plt.show()

This is the same as below, albeit we use Pandas.

import pandas as pd

data = {'a': np.arange(10),
    'b': np.arange(10)}
    
df=pd.DataFrame(data=data)

plt.scatter('a', 'b', c='g', data=df)
 

plt.show()

In this example, the values are a dictionary object with a and b the values shown below.

'b': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'a': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])}

We can pass the size of each point in as an array, too:

import pandas as pd

data = {'a': np.arange(10),
    'b': np.arange(10),
     'c':  np.arange(10) * 100
}
    
df=pd.DataFrame(data=data)

plt.scatter('a', 'b', c='g', s='c', data=df)
 

plt.show()

You could add the coordinate to this chart by using text annotations.

The arguments are matplotlib.pyplot.annotate(s, xy, *args, **kwargs)[.

Where:

  • s is the string to print
  • xy is the coordinates given in (x,y) format. Add 0.25 to x so that the text is offset from the actual point slightly.
  • **kwargs means we can pass it additional arguments to the Text object. And that has the properties of fontsize and fontweight.
import pandas as pd

data = {'a': np.arange(10),
    'b': np.arange(10),
     'c':  np.arange(10) * 100
}
    
df=pd.DataFrame(data=data)

plt.scatter('a', 'b', c='g', s='c', data=df)

for row in df.itertuples():
    x = row.a
    y = row.b 
    str = "({0},{1})".format(x,y)
    plt.annotate(str, (x + 0.25 ,y), fontsize='large', fontweight='bold')
    
 

plt.show()

Results in:

]]>
How to Add Subplots in Matplotlib https://www.bmc.com/blogs/matplotlib-subplots/ Thu, 05 Sep 2019 00:00:42 +0000 https://www.bmc.com/blogs/?p=15390 Start by plotting one chart onto the chart surface. Use plt.axes(), with no arguments. Matplotlib will then autofit the chart to our data. The function np.arange(0,25,0.1) creates 250 numbers ranging from 0 to 25 in increments of 0.1. The y axis will range between 1 and -1 since the sin function np.sin(x) ranges between 1 […]]]>

Start by plotting one chart onto the chart surface.

  • Use plt.axes(), with no arguments. Matplotlib will then autofit the chart to our data.
  • The function np.arange(0,25,0.1) creates 250 numbers ranging from 0 to 25 in increments of 0.1.
  • The y axis will range between 1 and -1 since the sin function np.sin(x) ranges between 1 and -1.
  • Annotate the chart by labelling each axis with plt.ylabel(‘sin(x)’) and plt.xlabel(‘x’).
import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0, 25,0.1)

axis1 = plt.axes()
plt.ylabel('sin(x)')
plt.xlabel('x')
axis1.plot(np.sin(x))

That results in this chart:

(This article is part of our Data Visualization Guide. Use the right-hand menu to navigate.)

Vertically stacked figures

Now, plot two charts, one stacked on top of the other. Use plt.subplots(2). Note that we plot sin(x) in the top chart and cos(x) in the bottom to avoid graphing the same data twice.

import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0, 25,0.1)
fig, axis = plt.subplots(2)


plt.ylabel('sin(x)')
plt.xlabel('x')
axis[0].plot(np.sin(x))
axis[1].plot(np.cos(x))

That results in:

Horizontal Side by Side Figures

Here we add:

  • fig, axis = plt.subplots(1,2,figsize=(15,5)) meaning 1 row and 2 columns. Add figsize meaning width and heights, respectfully.
  • Note: There is something not clear here. The Matplotlib documentation says this is given in inches, but it’s not, as the chart below will show the same size regardless of the size of your monitor—and why would a system used by people around the world not use the metric system? This seems to be a relative size. So, we may have to call this a documentation bug for now. (Write to us at blogs@bmc.com if you know the answer to this.)
import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0, 25,0.1)
fig, axis = plt.subplots(1,2,figsize=(15,5))

plt.ylabel('sin(x)')
plt.xlabel('x')
axis[0].plot(np.sin(x))
axis[1].plot(np.cos(x))

This results in the two charts placed side-by-side but spread farther apart.

]]>
Using Matplotlib to Draw Charts and Graphs https://www.bmc.com/blogs/matplot-charts-graphs/ Fri, 30 Aug 2019 00:00:52 +0000 https://www.bmc.com/blogs/?p=15315 If you are working with big data then you need to learn how to create charts, which are also known as graphs. Charting is a topic that can be extremely complicated, so we’ll start here with simple examples. (This article is part of our Data Visualization Guide. Use the right-hand menu to navigate.) Charts Require […]]]>

If you are working with big data then you need to learn how to create charts, which are also known as graphs. Charting is a topic that can be extremely complicated, so we’ll start here with simple examples.

(This article is part of our Data Visualization Guide. Use the right-hand menu to navigate.)

Charts Require a Graphical Environment

You cannot draw charts from a Python program running in a character-based environment like a bash shell. Instead, you need a graphical environment.

Most data scientists use Matplotlib in a browser, since a browser can display graphics. The best way is using Zeppelin or Jupyter, as both are code interpreters and tools that can display graphics.

Matplotlin was original written as the graphic front end for a tool called matlab. Matlab is mainly used by engineers and data scientists, but it works well with Python too.

Using Zeppelin

We’ve discussed Zeppelin here before. Zeppelin lets you run programs in a variety of programming languages in a web page. It supports Spark, Python, Angular, MarkDown, Livy, a Bash shell and others. (Because it supports bash shells you would not want to put it on a public-facing web page without adding a password to it.) Zeppelin also has some built-in graphical ability, but in order to create more advanced charts, you’ll need an advanced charting product, like Matplotlib.

Here we show how to use Matplot to draw line and scatter charts and histograms. First, you need to Install Zeppelin, which is as easy as downloading and unzipping it. On Mac, start the daemon with zeppelin-daemon.sh start.

Line Charts

A line chart plots one axis against another, such as the family xy axis used in high school algebra.

Zeppelin works with arrays. In the example below, we use the Numpy function np.arange(1,5,0.25) to create an array of evenly spaced intervals of 0.25. Then it’s as simple as calling plot: plt.plot(y, ‘bs’), where bs means blue squares. If you give it only one array, then it assumes that those are values for the y axis, so it automatically calculates the x axis.

%python
import matplotlib.pyplot as plt
import numpy as np
 
y = np.arange(1,5,0.25)
 
plt.plot(y, 'bs')

Here is the resulting chart.

Now let’s add annotations to the line. Remember that Matplotlib calculates the x values automatically based upon the y values. The slope (m) of the line above, y = mx +1 is approximately 0.29, just by visually looking at the line and estimating. So let’s plot those (x,y) coordinates and label each point.

Let’s make the increment 1 and not 0.25 in order to avoid crowding the chart.

In the code below, the python zip(A.B) function returns tuples ((x[0],y[0]) …. ) from the arrays A and B. We do this because we need each x,y coordinate in the form of the tupel (x,y). We use the array A and B to contain the x,y coordinates.

%python
import matplotlib.pyplot as plt
import numpy as np
 
y = np.arange(0,5,1)

A = []
B = []

for x in range(1, len(y)):
    y= (0.29*x)
    A.append(x)
    B.append(y)

for xy in zip(A,B):
    plt.plot( xy[0],xy[1],'gs')
    plt.annotate('(%s, %s)' % xy, xy=xy, textcoords='data')

Here is the same line, with each (x,y) coordinate printed as an annotation on the line.

Scatter Charts

Scatter charts plot points and not lines.

%python
import matplotlib.pyplot as plt
import numpy as np

n = 10
x = np.random.rand(n)
y = np.random.rand(n)
 
plt.scatter(x,y)

Because we used random integers, the points are all over the place. We used the rand() function and not randint, so it generated random floating point numbers <= 1.

Histogram

In a histogram the vertical bar is the percentage of all points for each value, aka a frequency distribution. So here 17.5% of the random numbers were 3. The number 100 is called the size of the array. Here we have a 1×100 array, equivalently called a vector.

If we had used, for example, x = np.random.randint(low = 0, high = 15, size=[4,4]) it would create a 4×4 matrix of random numbers.

%python
import matplotlib.pyplot as plt
import numpy as np
 

x = np.random.randint(low = 0, high = 15, size=100) 



plt.figure()
plt.hist(x)

plt.show()

This shows the frequency distribution. In other words, we told it to create 100 random integers between 0 and 15.

We previously explained how to create a Stacked Bar Chart here.

Pie Chart

Below we create a pie chart.  The sum of the percentages of each slice sums to 100%, just like a histogram.  So each slice xi (called a wedge my matplot lib) is xi / sum (x1 … xn) percentage of the whole pie.

Below we give it the labels.  Those have to be in the same order as the data as Matplotlib cannot automatically figure that out.  In other words get them out of order and your labels will not be logical.

We also provide autopct, to plot the percentage in each wedges.  We could also have put a custom function there instead.

%python
import matplotlib.pyplot as plt
import numpy as np

 
x = [100, 200, 300, 400, 500]
labels = ['first', 'second', 'third', 'fourth', 'fifth']

plt.pie(x,labels=labels,autopct='%1.1f%%')

plt.show()

]]>
Neo4j Graph Database Queries https://www.bmc.com/blogs/neo4j-graph-database-queries/ Fri, 29 Mar 2019 00:00:20 +0000 https://www.bmc.com/blogs/?p=13790 In the previous blog post, where we introduced Neo4j. Here we explain queries. First, start the server then open the shell: cd neo4j bin ./neo4j start ./cypher-shell -a bolt://localhost:7687 -u neo4j -p xxx (This article is part of our Data Visualization Guide. Use the right-hand menu to navigate.) Create Node and Relations Here we create […]]]>

In the previous blog post, where we introduced Neo4j. Here we explain queries.

First, start the server then open the shell:

cd neo4j bin

./neo4j start

./cypher-shell   -a bolt://localhost:7687  -u neo4j  -p xxx

(This article is part of our Data Visualization Guide. Use the right-hand menu to navigate.)

Create Node and Relations

Here we create two nodes. They each have label Friends and a name property. Relations both can have properties can have properties too.

CREATE (x:Friends { name: 'Walker' }) 
CREATE (y:Friends { name: 'Stephen' }) 
return x,y;

+------------------------------------------------------------+
| x                           | y                            |
+------------------------------------------------------------+
| (:Friends {name: "Walker"}) | (:Friends {name: "Stephen"}) |
+------------------------------------------------------------+

1 row available after 15 ms, consumed after another 0 ms
Added 2 nodes, Set 2 properties, Added 2 labels

Note: Spelling and case is important. Neo4j give you no warning if anything is spelled wrong. So pay attention when creating, for example, relationships as they will be empty with no warning given. So look for created x relationships and x rows available after each statement to make sure it worked.

We then create a relationship (called an edge) from Stephen to Walker. We give this relation the arbitrary name Friend. We put letters x, y, and r in front of objects so that we can refer to them in subsequent steps. So it is shorthand notation. Plus it brings the created object into scope.

The procedure to create a relation is to bring the nodes together with a MATCH and WHERE statement, then issue the CREATE with a directional arrow, to indicate the direction of the relation, or a simple line, if it runs both ways. For example, if Stephen is a Friend of Walker then Walker is a Friend of Stephen. So the arrow could be a simple line.

You use double lines –, →, ← when you want to indicate direction in queries (i.e, MATCH) statement, but not in the creation statement.

MATCH (x:Friends),(y:Friends)
WHERE y.name = "Stephen" AND x.name = "Walker"
CREATE (x)-[r:Friend]->(y)
RETURN type(r);

+----------+
| type(r)  |
+----------+
| "Friend" |
+----------+

List all Friend Nodes

Now, let’s try different queries. This one lists lists all nodes, since we did not give it any WHERE condition.

match(n:Friends)
       return n;
+------------------------------+
| n                            |
+------------------------------+
| (:Friends {name: "Walker"})  |
| (:Friends {name: "Stephen"}) |
+------------------------------+

List all Friend relations.

MATCH (a:Friends)-[:Friend]->(b:Friends)
       RETURN a.name, b.name;
+----------------------+
| a.name   | b.name    |
+----------------------+
| "Walker" | "Stephen" |
+----------------------+

Make Raj friends of Stephen.

CREATE (y:Friends { name: 'Raj' }); 

MATCH (x:Friends),(y:Friends)
WHERE y.name = "Stephen" AND x.name = "Raj"
CREATE (x)-[r:Friend]->(y)
RETURN type(r);

Make sure the command(mainly the MATCH statement) worked by paying attention to the numbers returned. Remember what we said about spelling errors and case.

1 row available after 3 ms, consumed after another 0 ms
Created 1 relationships

Show all friends relationships. Note it now includes the one we just added

MATCH (a:Friends)-[r:Friend]->(b:Friends)
       RETURN a.name, b.name;
+----------------------+
| a.name   | b.name    |
+----------------------+
| "Walker" | "Stephen" |
| "Raj"    | "Stephen" |
+----------------------+

Add 1 more, Teresa, then make her a Friend with Raj.

CREATE (x:Friends { name: 'Teresa' }) 
return x;

MATCH (x:Friends),(y:Friends)
WHERE y.name = "Stephen" AND x.name = "Raj"
CREATE (x)-[r:Friend]->(y);

MATCH (x:Friends),(y:Friends)
WHERE y.name = "Raj" AND x.name = "Teresa"
CREATE (x)-[r:Friend]->(y)
RETURN type(r);

Below we shown are friends of friends who are not of each other. That would be Teresa and Stephen are not friends.

To put this in terms of mathematics, which might make it simpler to understand, we could say that we look for the intersection of the set u and notFriend and then applying WHERE NOT that find objects outside that intersection.

To recall what friends we have to make this easier to see, first:

Match (a:Friends)--(b)
       return a,b;
+-------------------------------------------------------------+
| a                            | b                            |
+-------------------------------------------------------------+
| (:Friends {name: "Walker"})  | (:Friends {name: "Stephen"}) |
| (:Friends {name: "Stephen"}) | (:Friends {name: "Raj"})     |
| (:Friends {name: "Stephen"}) | (:Friends {name: "Raj"})     |
| (:Friends {name: "Stephen"}) | (:Friends {name: "Walker"})  |
| (:Friends {name: "Raj"})     | (:Friends {name: "Teresa"})  |
| (:Friends {name: "Raj"})     | (:Friends {name: "Stephen"}) |
| (:Friends {name: "Raj"})     | (:Friends {name: "Stephen"}) |
| (:Friends {name: "Teresa"})  | (:Friends {name: "Raj"})     |
+-------------------------------------------------------------+

The friends who.are-not-friends-of query lists Teresa and Stephen because Teresa is not a friend of Stephen. In other words, the first part of the query shows all relations for those people listed in variable u then it lists who is not in u.

Match (u:Friends)-[:Friend]->(:Friends)-[:Friend]->(notFriend:Friends)
       WHERE NOT (u)-[:Friend]->(notFriend)
       RETURN u, notFriend;
+------------------------------------------------------------+
| u                           | notFriend                    |
+------------------------------------------------------------+
| (:Friends {name: "Teresa"}) | (:Friends {name: "Stephen"}) |
| (:Friends {name: "Teresa"}) | (:Friends {name: "Stephen"}) |
+------------------------------------------------------------+

Directed Relations

Now we create a relationship in the opposite direction Friend A < – Friend B instead of Friend A -> Friend B by putting the directional arrow in the first position in the CREATE statement.

MATCH (x:Friends),(y:Friends)
WHERE y.name = "Raj" AND x.name = "Teresa"
CREATE (x)< -[r:Friend]-(y)
RETURN type(r);

Now list all relations, first the left-hand then non-directional then going rightward. An outbound relationship from a Friends node to another is the same as an inbound relationship since the nodes are the same. So the first and last queries list the same 4 relations. And a bi-directional query lists 8 since an inbound and an outbound both qualify as a relation in either direction, which is what the — means..

It would be simpler to see all of this if we had called one set of nodes Customers and another AccountManager, since there will be no logical symmetry. We will build up examples like that is subsequent blog posts.

Match (a:Friends)< --(b) return a,b; +------------------------------------------------------------+ | a | b | +------------------------------------------------------------+ | (:Friends {name: "Stephen"}) | (:Friends {name: "Raj"}) | | (:Friends {name: "Stephen"}) | (:Friends {name: "Raj"}) | | (:Friends {name: "Stephen"}) | (:Friends {name: "Walker"}) | | (:Friends {name: "Raj"}) | (:Friends {name: "Teresa"}) | +------------------------------------------------------------+ Match (a:Friends)--(b) return a,b; +-------------------------------------------------------------+ | a | b | +-------------------------------------------------------------+ | (:Friends {name: "Walker"}) | (:Friends {name: "Stephen"}) | | (:Friends {name: "Stephen"}) | (:Friends {name: "Raj"}) | | (:Friends {name: "Stephen"}) | (:Friends {name: "Raj"}) | | (:Friends {name: "Stephen"}) | (:Friends {name: "Walker"}) | | (:Friends {name: "Raj"}) | (:Friends {name: "Teresa"}) | | (:Friends {name: "Raj"}) | (:Friends {name: "Stephen"}) | | (:Friends {name: "Raj"}) | (:Friends {name: "Stephen"}) | | (:Friends {name: "Teresa"}) | (:Friends {name: "Raj"}) | +-------------------------------------------------------------+ 8 rows available after 22 ms, consumed after another 2 ms Match (a:Friends)-->(b)
       return a,b;
+------------------------------------------------------------+
| a                           | b                            |
+------------------------------------------------------------+
| (:Friends {name: "Walker"}) | (:Friends {name: "Stephen"}) |
| (:Friends {name: "Raj"})    | (:Friends {name: "Stephen"}) |
| (:Friends {name: "Raj"})    | (:Friends {name: "Stephen"}) |
| (:Friends {name: "Teresa"}) | (:Friends {name: "Raj"})     |
+------------------------------------------------------------+

4 rows available after 24 ms, consumed after another 2 ms
]]>