Walker Rowe – BMC Software | Blogs https://s7280.pcdn.co Wed, 10 Mar 2021 14:29:15 +0000 en-US hourly 1 https://s7280.pcdn.co/wp-content/uploads/2016/04/bmc_favicon-300x300-36x36.png Walker Rowe – BMC Software | Blogs https://s7280.pcdn.co 32 32 Power BI Visualization Types https://s7280.pcdn.co/power-bi-visualization-types/ Wed, 10 Mar 2021 14:29:15 +0000 https://www.bmc.com/blogs/?p=20389 This articles sums up some of the most common visualizations in Microsoft Power BI Desktop. (Remember that you create reports in BI Desktop and then publish them to powerbi.com.) (This article is part of our Power BI Guide. Use the right-hand menu to navigate.) Power BI standard visualizations Power BI has the normal set of […]]]>

This articles sums up some of the most common visualizations in Microsoft Power BI Desktop. (Remember that you create reports in BI Desktop and then publish them to powerbi.com.)

(This article is part of our Power BI Guide. Use the right-hand menu to navigate.)

Power BI standard visualizations

Power BI has the normal set of visuals in the default product. The palette of standard visualizations looks like this:

But the product is made better by third-party visualizations, where anyone can contribute a new or variant visualization to a central repository.

Power BI visuals

Here are some of the other Power BI Visuals:

Pie chart

The icons from the visualization palette are too small to copy here. Fortunately, the pie chart looks like a pie. Hover the mouse over the palette and the name fill pops up.

Like the histogram, the pie chart divides the data set into slices that together sum to 100%.

This example is expenses by category. The percentage and amount would make the chart too crowded if they were all added there. So that is done using tooltips, which are popups to show additional information.

Funnel

Here is the funnel. It also sums to 100%. This shows spending by category in descending order.

Stacked bar chart

A bar chart gives two dimensions, meaning one metric on the x axis and one on the y axis. To add another metric, you can use the actual bar itself to convey information.

In this stacked bar chart, we not only have spending by category, we have it by account. The vertical lines are scaled so that the first column represents one amount, while the second shows a different amount. That lets each line be the same height.

In the funnel and pie chart we had to drop categories as they were much larger than the other, thus squashing most categories that were too small to read.

Related reading

]]>
Getting Authentication Access Tokens for Microsoft APIs https://www.bmc.com/blogs/microsoft-apis-authentication-tokens/ Thu, 04 Mar 2021 13:58:59 +0000 https://www.bmc.com/blogs/?p=20334 In order to use Microsoft Power BI or other Microsoft APIs, you have to obtain an access token, also known as a bearer token. This is because Microsoft uses oAuth2, an industry standard protocol, for authentication. (In other words, a simple API key or username with a password is not enough.) In this tutorial, we […]]]>

In order to use Microsoft Power BI or other Microsoft APIs, you have to obtain an access token, also known as a bearer token. This is because Microsoft uses oAuth2, an industry standard protocol, for authentication. (In other words, a simple API key or username with a password is not enough.)

In this tutorial, we explain how to do that.

(This article is part of our Power BI Guide. Use the right-hand menu to navigate.)

Note: We use curl to post data to Microsoft endpoints. That’s like the command line version of Postman. On Mac and Ubuntu, curl is already there. You might have to install on Windows.

Registering Power BI

If you’re doing all this for the very first time, in order to perform both steps of oAuth2 authentication, there’s a Step 0.

You first have to register your application as a means of getting credentials. You do that one time. This generates an application ID and secret key. For Microsoft Power BI, you do it like this:

First, log into the embedding tool at https://app.powerbi.com/embedsetup/UserOwnsData

This is not the same as logging into Azure and creating an application in Active Directory there. You are creating an application on Power BI’s Azure account (if you want to think of it that way).

Next, fill out the screens below. Note that:

  • For the URL, you can use any web page. You will look at the parameters passed to this web page as we show below.
  • Skip the screen that says import content.
  • For API access, click select all.
  • At the end, copy and save the Application ID and Application Secret.

Using oAuth2 for rest APIs

Once you’ve registered, you can move to this step.

Basic authentication is when you need only a user ID and password for access to something.

But Microsoft uses oAuth2 authentication. Microsoft APIs require that you present an Authorization header in order to use the API. Basically, oAuth2 is a two-step process:

  1. Do a POST to login.microsoftonline.com
  2. Take the access/bearer token from Step 1 and pass that to the API in a header called Authorization for whatever API you are calling.

Getting a token (code)

To get the authorization code, click on this URL to open a browser:

https://login.microsoftonline.com/common/oauth2/authorize?client_id=(appid)&response_type=code&response_mode=query&redirect_uri=(url you put when you registered app)&scope=openid&state=foo

Basically, it will take you to the URL you put when you registered the application. But a screen will pop up asking you to grant certain permissions:

  • response_type: code
  • response_mode: query
  • state: foo (Sny value will work here, it’s just a place for free form-data.)
  • scope: openid (You could also add offline_access.)
  • url: We use the same URL throughout but change the URI to authorize and then token later to call different Microsoft endpoints.

Note: Here, the tenant ID is common, not a multi-tenant ID. Common means to retrieve the tenant ID associated with your Azure account.

Now, you certainly could have written some kind of web listener to retrieve the code that Microsoft created. But we will just use the debugger in a Chrome browser to see the query parameter that Microsoft passed to our web page.

When Microsoft redirects you to the web page you indicated, go to the network tab in the browser and click the refresh button on the browser.

Then click on the code field and press Copy as cURL. The code (token) appears as the query parameter code as shown below.

https://walkercodetutorials.com/?code=0.ASsARY...

If you are wondering at this point why the URL is not some URL in Power BI, that’s because you registered the application in Power BI. So, Microsoft knows that Power BI is what you want to use. The redirect URL serves merely as a place to retrieve this code.

Going forward, you would not want to click on the browser every time—this is not how a batch program would work. So look at the prompt setting in the Microsoft Identity Platform reference guide to see how to change that.

Getting an access token

We use curl to illustrate the next steps. Get the access token (bearer token) this way.

The values are:

  • grant_type: Put “authorization_code”
  • client_id: Application ID from above (The dots above hide my actual ID.)
  • client_secret: Application Secret from above
  • redirect_uri: Same as above
  • scope: Same as above
  • url: Note that the endpoint has changed to token
curl -X POST  --form 'grant_type=authorization_code' --form 'client_id=7...5' --form 'client_secret=21dVzEgtjUhfyZS3AJDaH0eMYB0q0ovYeH4YUoa//FM' --form 'scope=openid%20offline_access'--form 'response_type=code' --form 'redirect_uri=https://walkercodetutorials.com/' --form 'code=0.AS...AA' https://login.microsoftonline.com/common/oauth2/token

Returns:

{"token_type":"Bearer","expires_in":"3599","ext_expires_in":"3599","expires_on":"1614591204","not_before":"1614587304","resource":"https://analysis.windows.net/powerbi/api","access_token":"ey….G8CYZQT6t2p5IC1r3E7D_koNqc6h_-f3918o_BP2N0YOweCKKZ7WCw"}

Testing your Microsoft API access

Take the access_token value from the previous step and add it as an Authorization header value as shown below. (You have one hour before it expires.)

This, for example, is how you return a list of datasets in Power BI in My workspace. (That’s the default workspace for free Power BI accounts, meaning for one individual’s use only, as opposed to, for example, an enterprise account.)

Note: myorg does not mean your org. It’s just a placeholder required by Microsoft.

curl -X GET -H "Authorization: Bearer ey….W_A" -H "Content-Type: application/json" https://api.powerbi.com/v1.0/myorg/datasets

That concludes this tutorial.

Related reading

]]>
Creating & Using Linked Tables in Power BI https://www.bmc.com/blogs/power-bi-creating-linked-tables/ Wed, 24 Feb 2021 15:32:00 +0000 https://www.bmc.com/blogs/?p=20270 One good thing about Power BI is that when you add two tables to a dashboard they are synchronized. So, when you click on one table, the linked table filters on that selected value. (It uses relationships between tables to do that, which we’ve previously explained.) Let’s take a look at how this works. (This […]]]>

One good thing about Power BI is that when you add two tables to a dashboard they are synchronized.

So, when you click on one table, the linked table filters on that selected value. (It uses relationships between tables to do that, which we’ve previously explained.)

Let’s take a look at how this works.

(This article is part of our Power BI Guide. Use the right-hand menu to navigate.)

How to create linked tables

To illustrate, below is a report (dashboard) we want to make:

  • On the left, we have transaction categories from our financial accounts
  • On the right, transaction details.

The data is from the transactions.csv data file. (You can download your bank statement if you want to follow along.)

The data on the left is categories. The data on the right are transactions.

To put this in terms of SQL, the data on the left is basically the data:

select category, count(*) from transactions

The data on the right is:

Select * from transactions

We use the relationship wizard in Power BI to join them on the common element category. Then when we put two tables on the dashboard, Power BI uses this relationship to let us drill into the tables by category. In other words, we can see all our office expenses, advertising expenses, travel expenses, etc.

This is what the report looks like when we publish it to powerbi.com:

This is what the transaction detail data looks like:

Group by category

Here are the categories. To make this view of the data we add data the data source transactions.csv a second time, then we dropped all the columns except category. Then we pick group by category.

Now, pick the table visualization and the fields. For the category table we obviously just pick one field, category.

For the transactions table we pick all the transactions fields. Under fields we have the two data sources:

  • Categories
  • Transactions

Resizing the dashboard

Here is what the tables look like when put onto the dashboard. The table and table text are too small and not positioned in the right place when we start. So, grab the edges to move them around and then go to Page View/Actual Size to make them large enough to read.

 

Viewing relationships

Here is the relationship screen. We don’t have to do anything as power BI matches by the common element, category.

When designing the table, before we publish it to powerbi.com, we can test it. We cannot see the layout very well, meaning the full screen size or mobile layout.

But we can click on the category on the left. Then the table on the right updates to show only transactions in that selected category. You could call this synchronized tables.

Viewing full-size

As always, click Publish to Power BI to test the final version. And as we just said, it’s really the only way to see the full-sized screen as Power BI Desktop does not have a very good preview function.

Related reading

]]>
Creating Table Visualizations in Power BI Dashboards https://www.bmc.com/blogs/power-bi-linking-tables-relationships/ Thu, 18 Feb 2021 00:00:03 +0000 https://www.bmc.com/blogs/?p=20207 In this tutorial, we’ll show you how to create relationships between tables in Microsoft Power BI. The good news is that you do this with a click and point wizard—eliminating the need to write any SQL commands. (This article is part of our Power BI Guide. Use the right-hand menu to navigate.) Relationships in Power […]]]>

In this tutorial, we’ll show you how to create relationships between tables in Microsoft Power BI. The good news is that you do this with a click and point wizard—eliminating the need to write any SQL commands.

(This article is part of our Power BI Guide. Use the right-hand menu to navigate.)

Relationships in Power BI

In Power BI, a relationship documents the common elements between tables.

In the example we’ll use here, we have two tables from a sales system: customers and orders.

  • The common element is the customerNumber.
  • The customer table contains the customer name.
  • The orders table contains order amount, product sold, etc.

So, if you want to print the customer name on a sales report, for instance, you need to tell BI what columns link these two tables.

Sample data

To follow along with our example, download this data:

Import the two tables.

Relationship types

From the screen where you imported the data sources, click the Manage Relationships button. BI will guess what elements are in common. Since customerName is in both tables, it picks that. You can edit this if you need to change it.

The Edit button lets you edit the cardinality of the relationship. There are three types:

  • One-to-one. For each row in the left-hand table there is one and only one row in the right-hand table.
  • One-to-many. This is the most common scenario. It means for each row in one of the tables, there are more than one row in the other table. Which table is which depends on which you put on the left or the right. For our orders and customers tables, there are many (more than one) rows in the orders table for each row in the customers table.
  • Many-to-many. Think of this as a one-to-many relationship, but in both directions. For example, imagine a doctor who both prescribes medicines but also takes medicines. Many-to-many is not commonly used.

Click the third icon on the left and you get a visual view of the table relationships.

Now, go to the report layout screen and select the table type layout.

We see that customers and table tables are on the right-hand side. You pick the columns you want on the table from each. We pick the order information from the orders table, then the customer name from the customers table.

Because BI now understands how the two tables are related, it knows how to find the customer name given the customer number in the order table.

The report created is a bit small. Click on Focus Mode on the report to zoom in to make it larger.

Here you can see the Power BI has retrieved the customer name column from the customer table. In other words, it used the relationship to look this up.

That concludes this tutorial.

Related reading

]]>
How To Publish Power BI Reports https://www.bmc.com/blogs/power-bi-publish-reports/ Thu, 11 Feb 2021 14:31:47 +0000 https://www.bmc.com/blogs/?p=20176 In this short tutorial, we’ll explain how—and why—to publish Microsoft Power BI Reports. (Haven’t created a report yet? Learn how to create reports and pie charts.) (This article is part of our Power BI Guide. Use the right-hand menu to navigate.) Publishing in Power BI To publish a Power BI report means to push it […]]]>

In this short tutorial, we’ll explain how—and why—to publish Microsoft Power BI Reports.

(Haven’t created a report yet? Learn how to create reports and pie charts.)

(This article is part of our Power BI Guide. Use the right-hand menu to navigate.)

Publishing in Power BI

To publish a Power BI report means to push it to the cloud so you share it with other users at powerbi.com.

Publishing in this software does not mean publishing it to a web page. But you can do that by loading the report into an iFrame, an item in a web page, onto a web page.

What you can do is controlled by your license:

  • Power BI is free to publish in your own workspace in Office 365 apps, like Dynamo CRM.
  • Power BI Pro, for $9.99 per month, lets you share reports with other Power BI Pro users.
  • Power BI Premium, for $4,995 per month, lets you share reports with anyone, including people with no Power BI license.

Why you need to publish Power BI reports

You use Power BI Desktop to create reports. That is done by a single person on a single computer. In order to share it with other people, and thus make it useful to someone else, you need to push it to the Power BI cloud.

Another reason (or, a kind of a limitation) is that you need to publish the report to see what it will look like to the end user, such as on:

  • Desktop
  • Tablet
  • Phone

How to publish Power BI report

You simply push the Publish button—it’s as easy as that. Microsoft will prompt you to log in.

 

If you are using the free edition you can only publish it to the My WorkSpace workspace.

Here’s what a sample report looks like:

The reports you publish show up in your Office 365 Apps desktop:

That concludes this tutorial.

Related reading

]]>
How To Create Reports in Microsoft Power BI https://www.bmc.com/blogs/power-bi-create-reports/ Tue, 09 Feb 2021 15:01:22 +0000 https://www.bmc.com/blogs/?p=20140 In this article, we’ll show you how to create reports with Microsoft BI. The approach of Microsoft Power BI is a bit different than other products—it’s not always clear. The basic procedure is to repeat the steps below for each widget (graphical object) that you want on the report. You can follow along in this […]]]>

In this article, we’ll show you how to create reports with Microsoft BI.

The approach of Microsoft Power BI is a bit different than other products—it’s not always clear. The basic procedure is to repeat the steps below for each widget (graphical object) that you want on the report.

You can follow along in this article without doing the tutorial yourself. If you do want to work with it, we’ll use the same banking data from .csv files that we used previously, so review that to have some data to work with.

Let’s get started.

(This article is part of our Power BI Guide. Use the right-hand menu to navigate.)

How Power BI works: repeating pattern of steps

You can think of any report as a dashboard. A widget is some kind of visual display: a chart, a table, or just a single metric displayed in a text box.

The basic procedure is to repeat this process for each widget:

  1. Create a data source
  2. Run a transformation (Optional)
  3. Create a query or data model
  4. Pick a visualization
  5. Select fields
  6. Arrange visualization on dashboard

How to create Power BI reports

To illustrate, let’s move through each of these steps. First, we create a data source. That connects to a file, like a .csv, or a database.

Next, you have the option to run a transformation. In our example, we use financial data. We will:

  • Apply a filter to select only negative values (payments)
  • Drop and rename columns
  • Optionally apply a function, such as an aggregation

Step 3 is the natural result of step 2, because you have built up a query in stages.

Alternatively, at this point, you could create a data model. For example, if you have sales and inventory movements in two data sources you can model that. You would create a model to show the common element between tables: product number. (But in the example we’re using, we only have a single data source.)

In step 4, you create a visualization. In this example, we will have a table of transactions. A table is a row and column display. We will also have a single card (like a text box) to show a single number, the maximum transaction amount.

Next, we’ll pick fields from step 4. Finally, in the last step, we’ll position the visualizations on the dashboard.

Now, let’s walk through an actual example.

A hands-on tutorial

Here is the landing page for BI. By default, it shows a pie chart with no data. Notice the three icons on the left:

  • Dashboard
  • Queries
  • Data model

Adding the first visualization

The logical place to start is to select a data source.

The basic procedure is to load and optionally select transform. In most cases, you would want to do a transformation.

For example, let’s click a column, then apply a filter to only have negative values (payments):

Here we select a numeric column, amount. Because it’s a number, we can run a math or aggregation function on it.

We select the Statistics function Maximum:

The result is a scalar (single value), as opposed to a row in a row-column table.

Now click on the new field and give it a meaningful name. Notice that BI keeps track of the steps we have taken.

You’ll also want to rename the query. At this point, BI calls the results of the transformation a query.

Click the close & apply button to close the Power BI editor and return to the dashboard view.

Select the card visualization, then select field maximum from the query maximum.

The card is added to the report:

Adding more visualizations

Now we can add another visualization to show how to build up your report.

We will make a table. Select recent sources and pick the same .csv file. Importantly, we have to go all the way back to the beginning data source because we turned the first source into a query. (We can’t use the query to make a table, since it’s already transformed into a scalar.)

Now we have two queries:

  • wf is a table
  • maximum is the data source or the card visualization

Here’s what the table looks like when attached to the dashboard:

The text looks annoyingly small and graphic-like. It’s not like a spreadsheet, which would be clear and easy to read. (We will show how to clean that up in an upcoming tutorial.)

Finally, move the card over to make room for the table. Select the corner so you can resize it.

That concludes this tutorial. Now, you can begin building your reports with repeating the widget pattern.

Related reading

]]>
Top NumPy Statistical Functions & Distributions https://www.bmc.com/blogs/numpy-statistical-functions/ Wed, 27 Jan 2021 15:17:01 +0000 https://www.bmc.com/blogs/?p=20061 NumPy supports many statistical distributions. This means it can generate samples from a wide variety of use cases. For example, NumPy can help to statistically predict: The chances of rolling a 7 (i.e, winning) in a game of dice How likely someone is to get run over by a car How likely it is that […]]]>

NumPy supports many statistical distributions. This means it can generate samples from a wide variety of use cases. For example, NumPy can help to statistically predict:

  • The chances of rolling a 7 (i.e, winning) in a game of dice
  • How likely someone is to get run over by a car
  • How likely it is that your car will breakdown
  • How many people will be in line at the checkout counter

We explain by way of examples.

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

Randomness & the real work

The NumPy functions don’t calculate probability. Instead they draw samples from the probability distribution of the statistic—resulting in a curve. The curve can be steep and narrow or wide or reach a small value quickly over time.

Its pattern varies by the type of statistic:

  • Normal
  • Weibull
  • Poisson
  • Binomial
  • Uniform
  • Etc.

Most phenomena in the real world are truly random. For example, if we toss out nearsightedness, clumsiness, and absentmindness, then the chance that someone would get hit by a car is equal for all peoples.

The normal distribution reflects this.

When you use the random() function in programming languages, you are saying to pick from the normal distribution. Samples will tend to hover about some middle point, known as the mean. And the volatility of observations is called the variance. As the name suggests, if it varies a lot then the variance is large.

Let’s look at these distributions.

Normal

The arguments for the normal distribution are:

  • loc is the mean
  • scale is the square root of the variance, i.e. the standard deviation
  • size is the sample size or the number of trials. 400 means to generate 400 random numbers. We write (400,) but could have written 400. This shows that the values can be more than one dimension. We are just picking numbers here and not any kind of cube or other dimension.
import numpy as np
import matplotlib.pyplot as plt
arr = np.random.normal(loc=0,scale=1,size=(400,))
plt.plot(arr)

Notice in this that the numbers hover about the mean, 0:

Weibull

Weibull is most often used in preventive maintenance applications. It’s basically the failure rate over time. In terms of machines like truck components this is called Time to Failure. Manufacturers publish for planning purposes.

A Weibull distribution has a shape and scale parameter. Continuing with the truck example:

  • Shape is how quickly over time the component is likely to fail, or the steepness of the curve.
  • NumPy does not require the scale distribution. Instead, you simply multiply the Weibull value by scale to determine the scale distribution.
import numpy as np
import matplotlib.pyplot as plt
shape=5
arr = np.random.weibull(shape,400)
plt.hist(arr)

This histogram shows the count of unique observations, or frequency distribution:

Poisson

Poisson is the probability of a given number of people in the lines over a period of time.

For example, the length of a queue in a supermarket is governed by the Poisson distribution. If you know that, then you can continue shopping until the line gets shorter and not wait around. That’s because the line length varies, and varies a lot, over time. It’s not the same length all day. So, go shopping or wander the store instead of waiting in the queue.

import matplotlib.pyplot as plt
arr = np.random.poisson(2,400)
plt.plot(arr)

Here we see the line length varies between 8 and 0, The number function does not return a probability. Remember that it returns an observation, meaning it picks a number subject to the Weibull statistical cure.

Binomial

Binomial is discrete outcomes, like rolling dice.

Let’s look at the game of craps. You roll two dice, and you win when you get a 7. You can get a 7 with these rolls:

  • 1,6
  • 2,5
  • 3,4
  • 4,3
  • 5,2
  • 6,1

So, there are six ways to win. There are 6*6*36 possibilities. So, the chance of winning is 6/16=⅙.

To simulate 400 rolls of the dice, use:

import numpy as np
import matplotlib.pyplot as plt
arr = np.random.binomial(36,1/6,400)
plt.hist(arr)

In the 400 trials, two 6s were rolled about three times.

Uniform

Uniform distribution varies at equal probability between a high and low range.

import numpy as np
import matplotlib.pyplot as plt
arr = np.random.uniform(-1,0,1000)
plt.hist(arr)

Related reading

]]>
Using the NumPy Bincount Statistical Function https://www.bmc.com/blogs/numpy-bincount-function/ Wed, 20 Jan 2021 13:19:57 +0000 https://www.bmc.com/blogs/?p=20033 NumPy does a lot more than create arrays. This workhorse also does statistics and functions, such as correlation, which are important for scientific computing and machine learning. We start our survey of NumPy statistical functions with bincount(). (This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.) The bincount function In […]]]>

NumPy does a lot more than create arrays. This workhorse also does statistics and functions, such as correlation, which are important for scientific computing and machine learning.

We start our survey of NumPy statistical functions with bincount().

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

The bincount function

In NumPy, the bincount function counts the number of unique values in an array.

First we make an array with:

  • Three 1s
  • Two 2s
  • Five 4s
  • One 5
arr = np.array([1,1,1,2,2,3,4,4,4,4,4,5])

Results in:

array([1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 4, 5])

Then we use the NumPy bincount() function to count unique elements.

d=np.bincount(arr)

Results in an array of counts by index position. In other words, it counts from left to right.

Note the 0 in front. For whatever odd reason, NumPy returns one more bin than the size of the array.  So, we will make some adjustments for that.

array([0, 3, 2, 1, 5, 1])

We make an array with unique elements from arr. We do this so we can plot the count against the values later.

a=np.unique(arr)

Results in:

array([1, 2, 3, 4, 5])

Because NumPy returns one more bin than the size of the array, we insert a 0 at the beginning so that the unique count and the bincount are the same shape so we can plot them.

b=np.insert(arr,0,[0])

This gives us:

array([0, 1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 4, 5])

Then we make a unique list out of that:

c=np.unique(b)

Now it has the extra 0 to line up with the bincount

array([0, 1, 2, 3, 4, 5])

Now c and d are the same shape, so we can plot them using Matplotlib.

plt.bar(c,d)

Results in this chart:

As you can see, there are:

  • Five elements with value 0
  • One element with value 3

The complete code

Here is the complete code.

import numpy as np
import matplotlib.pyplot as plt
arr = np.array([1,1,1,2,2,3,4,4,4,4,4,5])
d=np.bincount(arr)
a=np.unique(arr)
b=np.insert(arr,0,[0])
c=np.unique(b)
plt.bar(c,d)

Related reading

]]>
Using StringIO to Read Delimited Text Files into NumPy https://www.bmc.com/blogs/numpy-text-files-stringio/ Tue, 12 Jan 2021 14:06:09 +0000 https://www.bmc.com/blogs/?p=19923 In this tutorial, we’ll show you how to read delimited text data into a NumPy array using the StringIO package. (This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.) Data we used We will read this crime data: ,crime$cluster,Murder,Assault,UrbanPop,Rape Alabama,4,13.2,236,58,21.2 Alaska,4,10,263,48,44.5 Arizona,4,8.1,294,80,31 Arkansas,3,8.8,190,50,19.5 California,4,9,276,91,40.6 Colorado,3,7.9,204,78,38.7 Connecticut,2,3.3,110,77,11.1 Delaware,4,5.9,238,72,15.8 Florida,4,15.4,335,80,31.9 Parameters In […]]]>

In this tutorial, we’ll show you how to read delimited text data into a NumPy array using the StringIO package.

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

Data we used

We will read this crime data:

,crime$cluster,Murder,Assault,UrbanPop,Rape
Alabama,4,13.2,236,58,21.2
Alaska,4,10,263,48,44.5
Arizona,4,8.1,294,80,31
Arkansas,3,8.8,190,50,19.5
California,4,9,276,91,40.6
Colorado,3,7.9,204,78,38.7
Connecticut,2,3.3,110,77,11.1
Delaware,4,5.9,238,72,15.8
Florida,4,15.4,335,80,31.9

Parameters

In the code below, we download the data using urllib. Then we use np.genfromtxt to import it to the NumPy array. Note the following parameters:

delimiter=”,” The delimiter between columns.
skip_header=1 We skip the header since that has column headers and not data.
dtype=dtypes This parameter means use the tuples (name, dtype) to convert the data using the name as the assigned numpy dtype (data type).

If we don’t want to assign names we would use (dtype1, dtype2, …).

Note that we use the type float. Since NumPy is built using the C language, you can use any of the many ctypes, like 32 bit integers etc.

We use S12 for str as str converts this data to ” “. You could also use unicode U12.

We also could have written np.string_ and np.unicode_ but that does not give any length, so it means a null terminated byte, which is not a string. So, it would return a blank space.

We could have used object as well.

Note that NumPy uses these names:

·        dtype=[(‘crime’, ‘S12’), (‘cluster’, ‘<f8’), (‘Murder’, ‘<f8’), (‘Assault’, ‘<f8’), (‘UrbanPop’, ‘<f8’), (‘Rape’, ‘<f8’)])

·        The < sign refers to the byte order which can be little-endian or big-endian.

usecols=(1,5) We did not use this parameter. If we had used it, it would have skipped the first column.

The code explained

Here is the code:

import urllib
import numpy as np
from io import StringIO
url = "https://raw.githubusercontent.com/werowe/MLexamples/master/crime_data.csv"
file = urllib.request.urlopen(url)
data = ""
for d in file:
data = data + d.decode('utf-8')
dtypes=[('crime',"S12"),
('cluster', float),
('Murder' ,float),
('Assault',float),
('UrbanPop',float),
('Rape',float)]
arr=np.genfromtxt(StringIO(data), delimiter=",", skip_header=1,
dtype=dtypes)

Results in:

array([(b'Alabama', 4., 13.2, 236., 58., 21.2),
(b'Alaska', 4., 10. , 263., 48., 44.5),

Note that NumPy returned a byte array for the string column. If we want a string, we can use Unicode:

dtypes=[('crime','U25'),
('cluster', '>f'),
('Murder' ,float),
('Assault',float),
('UrbanPop',float),
('Rape',float)]

Results in:

array([('Alabama', 4., 13.2, 236., 58., 21.2),
('Alaska', 4., 10. , 263., 48., 44.5),

If we leave off dtypes and let NumPy pick the data types, it NaN (missing data) to the string column. It also uses float as the default for all numeric values.

arr=np.genfromtxt(StringIO(data), delimiter=",", skip_header=1)

Results in:

array([[  nan,   4. ,  13.2, 236. ,  58. ,  21.2],
[  nan,   4. ,  10. , 263. ,  48. ,  44.5],

Having assigned names to columns we can refer to their name instead of index:

arr['Murder']
array([13.2, 10. ,  8.1,  8.8,  9. ,  7.9,  3.3,  5.9, 15.4, 17.4,  5.3,
2.6, 10.4,  7.2,  2.2,  6. ,  9.7, 15.4,  2.1, 11.3,  4.4, 12.1,
2.7, 16.1,  9. ,  6. ,  4.3, 12.2,  2.1,  7.4, 11.4, 11.1, 13. ,
0.8,  7.3,  6.6,  4.9,  6.3,  3.4, 14.4,  3.8, 13.2, 12.7,  3.2,
2.2,  8.5,  4. ,  5.7,  2.6,  6.8])

Missing values

We can tell NumPy to plug in a value for a missing value, like -1, using missing_values. The default behavior for floats is np.nan. For int it is -1.

Alaska,4,10,263,48,44.5
Arizona,4, ,1,294,80,31

That concludes this tutorial.

Related reading

]]>
NumPy Introduction with Examples https://www.bmc.com/blogs/numpy-introduction/ Thu, 07 Jan 2021 12:11:27 +0000 https://www.bmc.com/blogs/?p=19869 If we study Pandas, we have to study NumPy, because Pandas includes NumPy. Here, I’ll introduce NumPy and share some basic functions. (This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.) What is NumPy? NumPy is a package that create arrays. It lets you make arrays of numbers with different […]]]>

If we study Pandas, we have to study NumPy, because Pandas includes NumPy. Here, I’ll introduce NumPy and share some basic functions.

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

What is NumPy?

NumPy is a package that create arrays. It lets you make arrays of numbers with different precision and scale, plus string, so it is especially useful for scientific computing.

Python by itself only has floats, integers, and imaginary numbers. But NumPy expands what Python can do because it handles:

  • 32-bit numbers
  • 15 big numbers
  • Signed numbers
  • Unsigned numbers
  • And more

But that’s not the only reason to use NumPy. It’s designed for efficiency and scale, making it the workhouse for large machine learning (ML) libraries like TensorFlow.

tensor flow

Now, let’s take a look at some basic functions of NumPy arrays.

Creating a NumPy array

Create an array with np.array(<array>).

Don’t put np.array(1,2,3,4,5) as 1,2,3,4,5 is not an array. NumPy would interpret the items after the commas as parameters to the array() function.

This creates an array:

import numpy as np
arr = np.array([1,2,3])
arr

Results:

array([1,2,3])

Array shape

An array has shape, just like, for example, a 2×2 array, 2×1 array, etc.

Query the shape like this:

arr.shape

You should call this a vector if you want to understand this better as it’s not 3×1—because it only has one dimension, and a blank is not a dimension.

(3,)

This is 3×1 since it is an array of 3 arrays of dimension 1×1.

arr = np.array([[1],[2],[3]])
arr.shape

Results:

(3, 1)

Reshaping an array

You can reshape an array of shape m x n into any combination that is a divisor of m x n. This array of shape (6,) can be reshaped to 2×3 since 2*3=6 divides 6.

import numpy as np
arr = np.array([1,2,3,4,5,6]).reshape(2,3)
print(arr)

Results:

[[1 2 3]

[4 5 6]]

Arange

Notice that this function is not arrange but arange, as in array range. Use it to file an array with numbers. (There are lots of ways to do that, a topic that we will cover in a subsequent post.)

import numpy as np
arr = np.arange(5)
arr

Results in:

array([0, 1, 2, 3, 4])

Slice

Slicing an array is a difficult topic that becomes easier with practice. Here are some simple examples.

Take this array.

arr = np.array([1,2,3,4,5,6]).reshape(2,3)
arr

Which looks like this:

array([[1, 2, 3],
[4, 5, 6]])

(While you could say this has 2 rows and 3 columns to make it easier to understand, that’s not technically correct. When you have more than two dimensions, the concept or rows and columns goes away. So that’s why it’s better to say dimensions and axes.)

This slice operations means start at the second position of the first axis and go to the end:

arr[1:]

Results in:

array([[4, 5, 6]])

This starts at the beginning and goes to the end:

arr[0:]

Results in:

array([[1, 2, 3],
[4, 5, 6]])

Add a comma to specify which column:

arr[:,1]

Results in:

array([2, 5])

Select along the other axis like this:

arr[1,:]

Results in:

array([4, 5, 6])

Select a single element.

arr[1,0]

Results:

4

Step

arr = np.array([1,2,3,4,5,6])
arr[1:6:2]
array([2, 4, 6])

That concludes this introduction.

Related reading

]]>