ElasticSearch – BMC Software | Blogs https://s7280.pcdn.co Tue, 09 Nov 2021 10:09:42 +0000 en-US hourly 1 https://s7280.pcdn.co/wp-content/uploads/2016/04/bmc_favicon-300x300-36x36.png ElasticSearch – BMC Software | Blogs https://s7280.pcdn.co 32 32 Logstash 101: Using Logstash in a Data Processing Pipeline https://s7280.pcdn.co/logstash-using-data-pipeline/ Tue, 09 Nov 2021 10:06:41 +0000 https://www.bmc.com/blogs/?p=51079 Data is a core part of any modern application. It is the catalyst for most technological functions and use-cases, from server, infrastructure, and application troubleshooting all the way to analyzing user behavior patterns and preferences and building complex machine learning models. We need to collect data from various sources to achieve all these things. The […]]]>

Data is a core part of any modern application. It is the catalyst for most technological functions and use-cases, from server, infrastructure, and application troubleshooting all the way to analyzing user behavior patterns and preferences and building complex machine learning models.

We need to collect data from various sources to achieve all these things. The ELK stack is one of the leading solutions when it comes to analyzing application or server data. It is a collection of three open-source products:

  • Elasticsearch
  • Logstash
  • Kibana

Elasticsearch, based on the Lucene engine, is the storage and analytical backbone of the ELK stack. Kibana lets you visualize the data on Elasticsearch in any shape or form to gain better and easily understandable insights from data. Logstash is the ingest engine and the starting point of the ELK, which aggregates data from multiple services, files, logs, etc., and pushes it to Elasticsearch for further analysis.

In this article, we will focus on Logstash and how it can be used in a data processing pipeline.

(This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.)

What is Logstash?

Logstash is a free and open-source, server-side data processing pipeline that can be used to ingest data from multiple sources, transform it, and then send it to further processing or storage.

While Logstash is an integral part of the ELK stack, it does not mean Logstash is limited to use with those tools. It can push data not only to Elasticsearch but also to other services like Zabbix, Datadog, InfluxDB, MongoDB, and even to message brokers like RabbitMQ, Amazon SQS, etc.

Logstash consists of three core sections—inputs, filters, and outputs—as shown here:

Logstash

Logstash inputs

The primary feature of Logstash is its ability to collect and aggregate data from multiple sources. With over 50 plugins that can be used to gather data from various platforms and services, Logstash can cater to a wide variety of data collection needs from a single service. These inputs range from common inputs like file, beat, Syslog, stdin, UDP, TCP, HTTP, heartbeat to specific services such as azure event hub, Apache Kafka, AWS Kinesis, Salesforce, and SQLite.

Inputs are the starting point of Logstash configuration. By default, Logstash will automatically create a stdin input if there are no inputs defined. One of the most important considerations is that users will need to properly configure, categorize, and tag the inputs so that the data can be properly processed (filtered) and sent to its required output destination.

Logstash filters

What distinguishes Logstash from most other services is its ability to apply filters to the input data and process it. Unlike acting as a simple aggregator and pushing data, Logstash extracts information from raw data and transforms it into more meaningful common formats as an intermediary step before sending it for further processing or storage.

This process not only reduces the workload of further processing services like Elasticsearch; it also provides a common format that can be easily processed for better analysis of the gathered data. Logstash provides multiple filter plugins from a simple CSV plugin to parse CSV data to grok, allowing unstructured data to be parsed into fields to urldecode, etc.

Logstash outputs

The final part of Logstash is its output. As mentioned earlier, Logstash can output the collected (input) and processed (filter) data into a variety of outputs from Elasticsearch itself to simple files, storage services like S3, messaging services like SQS, and Kafka to other services like AWS CloudWatch and Google BigQuery.

All these plugins lead Logstash to become one of the most versatile solutions for gathering data.

Outputs are not limited to a single destination. By default, Logstash creates a stdout output if no outputs are configured. However, Logstash can be configured to push data to multiple destinations while filtering out specific inputs to specific outputs.

Configuring Logstash

Now that we understand the architecture of Logstash and what it can do, let’s look at how to configure Logstash. The primary requirement for Logstash is Java (JVM). Most Logstash architecture-specific installation bundles come with the necessary Java version bundled while allowing users to change the Java version as required.

Installing Logstash

Logstash can be installed on all major operating systems, including Windows, Linux, and macOS. Additionally, it provides support for Docker images as well as Helm Charts for direct Kubernetes deployments.

Let’s start with the following Docker Logstash example. Creating the Logstash container is a relatively straightforward process. Simply pull the Logstash image and then create the container with the Logstash configuration attached.

# Pull Image
docker pull docker.elastic.co/logstash/logstash:<version>
# Create Container - Config Folder
docker run --rm -it -v ~/config/:/usr/share/logstash/config/ docker.elastic.co/logstash/logstash:<version>
# Create Container - File Attached
docker run --rm -it -v ~/config/logstash.yml:/usr/share/logstash/config/logstash.yml docker.elastic.co/logstash/logstash:<version>

Logstash configuration structure

The format of a Logstash configuration file will consist of input, filter, and output sections, respectively. Single Logstash instance can have multiple configuration files that collect data from multiple services and filter then push to the desired destination. Typically, these configurations live in the /etc/logstash/conf.d/ folder with a file ending with the (.)conf file type.

The example below shows a simple Logstash configuration without filters that capture data from a file and output it to another file without any filtering.

read-log.conf

input {
file{
path => "/tmp/*.logs"
start_position => "beginning"
codec => json
}
}
output {
file {
path => "home/user/logstash_out.log"
}
}

If we look at a more comprehensive configuration file where we need to push Nginx logs to Elasticsearch, we can define it as follows:

nginx-logs.conf

input {
file {
type => "nginx"
path => "/var/log/nginx/*"
exclude => "*.gz"
}
}
filter {
if [type] == "nginx" {
grok {
patterns_dir => "/etc/logstash/patterns"
match => { "message" => "%{NGINX_ACCESS}" }
remove_tag => ["_grokparsefailure"]
add_tag => ["nginx_access"]
}
geoip {
source => "clientip"
}
}
}
output {
elasticsearch {
hosts => ["www.test.host:9200"]
}
}

Logstash alternatives

Logstash is powerful and versatile, yet it is not the simplest or the only solution in the market.

Due to its feature set, there will be a comparatively higher learning curve as well as higher configuration and resource requirements. In this section, we will explore some alternatives to Logstash that can act as the starting point of a data processing pipeline to ingest data.

Filebeat

Filebeat is a lightweight log shipper from the creators of Elastic stack. As a part of the beats family, Filebeat specializes in collecting data from specified files or logs. This, in turn, leads to Filebeat being less resource intensive than Logstash while providing the ability to collect and push data.

Filebeat is an excellent choice when users need simple data to ingest functionality. It can even be used as an input for Logstash. However, the major downside of Filebeat compared to Logstash is its limited functionality. Filebeat functionality is limited to only pushing collected logs, and the outputs are limited to Logstash, Elasticsearch, Kafka, and Redis.

Logagent

Logagent from Sematext is another open-source, cloud-native lightweight data shipper that is a direct competitor of Logstash. It provides the ability to filter the data collected via Logagent and supports multiple input and output options. These options include Syslog, file, azure vents, webhook, and even MySQL, MSSQL queries, etc., as inputs and ElasticSearch, Amazon ElasticSearch, Prometheus, etc. as outputs. Logagent also has a relatively gentle learning curve compared to Logstash.

However, as the more mature platform, Logstash offers more options when it comes to input, filter, and output and provides more flexibility to support different kinds of data ingestion and processing needs.

Fluentd

This open-source data collector is aimed at providing a unified logging layer. Fluentd is one of the reliable and extensible data collectors built with reliability and efficiency in mind. The major feature of Fluentd is that it will try to structure data as JSON as much as possible, thus allowing a much easier data processing experience for downstream (output) services. Fluentd is now a part of the CNCF project and provides integrations with over five hundred platforms, services, and technologies.

Fluentd is an excellent choice if you want to parse data in a structured format. However, it will require workarounds such as filtering via a regular expression, tags, etc., when users need to collect and push unstructured data. Furthermore, Fluentd can be a complex solution for beginners with a larger learning curve.

rsyslog

A Rocket-fast system for log processing or rsyslog is a log processor aimed at providing the highest performance. It can leverage modern hardware more efficiently with full multi-threading support. It is valuable when parsing data with multiple rules, and rsyslog can offer the same level of performance across the board for any number of rules. rsyslog also supports a multitude of input and output options with the ability to filter any part of Syslog messages.

The primary disadvantage of rsyslog is its complexity. For instance, correctly setting up and parsing data in rsyslog is the most complex process compared to other solutions. However, if you can overcome this configuration hurdle, rsyslog can offer a stable experience for most use cases.

Logstash is user-friendly, feature-rich

Logstash is one of the most user-friendly and feature-rich data collection and processing tools. As part of the ELK stack, Logstash has industry-wide recognition and adaptation to collect, aggregate, filter, and output data allowing users to build robust data processing pipelines.

Moreover, Logstash is one of the best options for ingesting data from various inputs and parsing and transforming it before sending it to storage or further analytics.

Related reading

]]>
ElasticSearch Commands Cheat Sheet https://www.bmc.com/blogs/elasticsearch-commands/ Wed, 29 Apr 2020 00:00:15 +0000 https://www.bmc.com/blogs/?p=12964 Here we show some of the most common ElasticSearch commands using curl. ElasticSearch is sometimes complicated. So here we make it simple. (This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.) delete index Below the index is named samples. curl -X DELETE 'http://localhost:9200/samples' list all indexes curl -X GET 'http://localhost:9200/_cat/indices?v' […]]]>

Here we show some of the most common ElasticSearch commands using curl. ElasticSearch is sometimes complicated. So here we make it simple.

(This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.)

delete index

Below the index is named samples.

curl -X DELETE 'http://localhost:9200/samples'

list all indexes

curl -X GET 'http://localhost:9200/_cat/indices?v'

list all docs in index

curl -X GET 'http://localhost:9200/sample/_search'

query using URL parameters

Here we use Lucene query format to write q=school:Harvard.

curl -X GET http://localhost:9200/samples/_search?q=school:Harvard

Query with JSON aka Elasticsearch Query DSL

You can query using parameters on the URL. But you can also use JSON, as shown in the next example. JSON would be easier to read and debug when you have a complex query than one giant string of URL parameters.

curl -XGET --header 'Content-Type: application/json' http://localhost:9200/samples/_search -d '{
      "query" : {
        "match" : { "school": "Harvard" }
    }
}'

list index mapping

All Elasticsearch fields are indexes. So this lists all fields and their types in an index.

curl -X GET http://localhost:9200/samples

Add Data

curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/1 -d '{
   "school" : "Harvard"			
}'

Update Doc

Here is how to add fields to an existing document. First we create a new one. Then we update it.

 

curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/2 -d '
{
    "school": "Clemson"
}'

curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/2/_update -d '{
"doc" : {
               "students": 50000}
}'

 

backup index

curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/_reindex -d '{
  "source": {
    "index": "samples"
  },
  "dest": {
    "index": "samples_backup"
  }
}'

 

Bulk load data in JSON format

export pwd="elastic:"

curl --user $pwd  -H 'Content-Type: application/x-ndjson' -XPOST 'https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/0/_bulk?pretty' --data-binary @<file>

Show cluster health

curl --user $pwd  -H 'Content-Type: application/json' -XGET https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/_cluster/health?pretty

Aggregation and Bucket Aggregation

For an nginx web server this produces web hit counts by user city:

curl -XGET --user $pwd --header 'Content-Type: application/json'  https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/logstash/_search?pretty -d '{
        "aggs": {
             "cityName": {
                    "terms": {
                     "field": "geoip.city_name.keyword",
                                "size": 50

        }
   }
  }
}
'

This expands that to product response code count by city in an nginx web server log

curl -XGET --user $pwd --header 'Content-Type: application/json'  https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/logstash/_search?pretty -d '{
        "aggs": {
          "city": {
                "terms": {
                        "field": "geoip.city_name.keyword"
                },
        "aggs": {
          "responses": {
                "terms": {
                     "field": "response"
                 }
           }
         }
      },
      "responses": {
                "terms": {
                     "field": "response"
                 }
        }
   }
}'

Using ElasticSearch with Basic Authentication

If you have turned on security with ElasticSearch then you need to supply the user and password like shown below to every curl command:

curl -X GET 'http://localhost:9200/_cat/indices?v' -u elastic:(password)

Pretty Print

Add ?pretty=true to any search to pretty print the JSON. Like this:

 curl -X GET 'http://localhost:9200/(index)/_search'?pretty=true

To query and return only certain fields

To return only certain fields put them into the _source array:

GET filebeat-7.6.2-2020.05.05-000001/_search
 {
    "_source": ["suricata.eve.timestamp","source.geo.region_name","event.created"],
    "query":      {
        "match" : { "source.geo.country_iso_code": "GR" }
    }
}

To Query by Date

When the field is of type date you can use date math, like this:

GET filebeat-7.6.2-2020.05.05-000001/_search
 {
    "query": {
        "range" : {
            "event.created": {
                "gte" : "now-7d/d"
            }
        }
}
}
]]>
How To Use Elastic Enterprise Search with GitHub https://www.bmc.com/blogs/elastic-enterprise-search-github/ Thu, 23 Jan 2020 00:00:47 +0000 https://www.bmc.com/blogs/?p=16317 Elastic Company has acquired Swiftype for its product portfolio, branding it Elastic Enterprise Search. This product gives users the ability to query a variety of data sources, including public sources and internal company documents and data sources. We previously explained how to install Enterprise Search. In this article, I’ll illustrate how it works by connecting […]]]>

Elastic Company has acquired Swiftype for its product portfolio, branding it Elastic Enterprise Search. This product gives users the ability to query a variety of data sources, including public sources and internal company documents and data sources.

We previously explained how to install Enterprise Search. In this article, I’ll illustrate how it works by connecting it to GitHub.

(This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.)

Overview: How Elastic Enterprise Search works

Enterprise Search offers the ability for users to query data sources using natural language. It is particularly useful within organizations who share internal documents. Popular sources you can query with Enterprise Search include:

  • Dropbox
  • Google Docs
  • GitHub
  • Microsoft OneDrive
  • Jira
  • Salesforce
  • Custom sources (via APIs)

Enterprise Search works by indexing search data in ElasticSearch and connecting to the data source using OAuth, an industry standard for authenticating apps. To understand OAuth, I like it to when you use Facebook or Google credentials to look into an app.

Note on GitHub limitations

You cannot use Enterprise Search with your own personal GitHub repository. Instead you must use an organizational repository. In other words, if you are an employee named Fred working at Smith Airlines, then you can search Smith Airlines. You cannot search Fred. That makes sense since Enterprise Search is designed for an enterprise and not a single individual.

Setting up Elastic Enterprise Search

Follow these steps to set up Elastic Enterprise Search.

  1. Create an OAuth App in GitHub. This is where you define the callback URLs that points to your Enterprise Search Installation. It also creates the Client ID and Client Secret needed to connect to Enterprise Search.
  2. Create the GitHub source in Enterprise Search.
  3. Enterprise Search polls GitHub for activity.
  4. Start searching.

Configuring GitHub OAuth Settings

Login to GitHub and click on Settings –> Developer setting for the repository. Make sure you click the organizational repository settings and not your personal settings.

In this example the repository is walkerrowe:

Go to Developer settings then create a New OAuth App.

Give it a name. For the callback URL, use these links:

Homepage URL https://(your server):3002
Authorization callback URL http://(your server):3002/ent/

Note: the Swiftype documentation mentions localhost. Do not use that. (GitHub cannot reach your localhost.) Instead, it must be the public IP address of your Enterprise Search server or the private IP if you are running GitHub internally. You will need to open firewall port 3002.

Click Register Application then note the client ID and client secret. You will put those credentials into Enterprise Search.

Add GitHub Source in Elastic Enterprise Search
Click on Add a Source.

Select GitHub.

Then follow the screens. If you are already logged into GitHub, it will try to use those credentials. So, logout of GitHub.

Fill in the client ID and secret. You don’t put the URL like github.com/(your organization). Instead GitHub locates your repository by your client ID.

As you would see if you are logging into some application using Facebook or Google, GitHub asks you for permission to connect the two. If you get any error message here, check the callback URL you put above. GitHub needs to be able to reach that from the GitHub servers.

Click through this screen.

Changing configuration and handling debug errors

If you make a mistake, don’t click on “Add a source” again. Instead, go into settings in Enterprise Search, also located on the left-hand menu.

Then select the configure button shown below

Verifying your connection works

You should see some activity now:

Searching

Oddly enough, the search screen in Enterprise Search is hidden. It’s not on the main landing http://(your server):3002. Instead, look on the left-hand side for Go to Search Application.

Their search syntax is natural language, but you do need to use certain keywords (see Help with the Search Syntax). It’s not well documented, yet.

When I type:

creator is walkerrowe

 

It shows these objects:

Then I typed the name of a repository I created, esearch. It presented this screen. Click on the item and it gives you the chance to look at it in GitHub.

You can refer to the Enterprise Search Searcher’s Manual for search syntax, but it gives very few examples. For example, it says that, as you type a search question, it highlights words that it finds in blue. That did not work for me using Chrome on Mac. It also seems to search files but not the content of files. In other words, it’s not indexing every word in your Google docs or Sheets.

Since the documentation is sparse, consider asking questions on the Enterprise Search community.

]]>
How to Install Elastic Enterprise Search https://www.bmc.com/blogs/install-elastic-enterprise-search/ Thu, 16 Jan 2020 00:00:31 +0000 https://www.bmc.com/blogs/?p=16277 Elastic.co has a product called Enterprise Search, formerly Swiftype, that’s aimed at businesses. Enterprise Search is like Google Search for internal company documents—an enterprise search tool for internal documents and files. It lets companies control who can access what documents. You can also use it to search public files on Google Drive, Github, Docker, etc., […]]]>

Elastic.co has a product called Enterprise Search, formerly Swiftype, that’s aimed at businesses. Enterprise Search is like Google Search for internal company documents—an enterprise search tool for internal documents and files. It lets companies control who can access what documents. You can also use it to search public files on Google Drive, Github, Docker, etc., and write your own API to expose documents and files to internal users.

In this blog post, I’ll illustrate how to install Elastic Enterprise Search. In a subsequent post, I’ll talk about how to use it.

(This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.)

Install Elasticsearch

First, you have to download and install Elasticsearch —follow these steps. (Note: Enterprise Search will also install Filebeat. Its config file will be located here /usr/share/elasticsearch/enterprise-search-7.5.0/filebeat/filebeat.yml.)

Elasticsearch does not require a paid license, but Enterprise Search does. Luckily, you can use Enterprise Search for free for 30 days to evaluate it.

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.5.0-amd64.deb

sudo dpkg -i elasticsearch-7.5.0-amd64.deb

Turn on security and bind Elasticsearch to a routable IP address, not localhost, so you could add other machines to the cluster:

sudo vim /etc/elasticsearch/elasticsearch.yml

add:

xpack.security.enabled: true
network.host: 172.31.46.15

Assuming you are logged in as user ubuntu (or change the name to your userid), change all folder permissions to ubuntu. This step is not logical, since you can’t run Elasticsearch as root. So, this patches up a step left out of their .deb file.

sudo chown -R ubuntu  /usr/share/elasticsearch
sudo chown -R ubuntu /var/log/elasticsearch/
sudo chown -R ubuntu /var/lib/elasticsearch/
sudo chown -R ubuntu /etc/elasticsearch
sudo chown ubuntu /etc/default/elasticsearch

Start Elasticsearch. If you cannot start it as a service, because it throws an error, you can start it this way. Note: you cannot run it as root.

cd  /usr/share/elasticsearch/bin
nohup ./elasticsearch&

Run this command to generate passwords for Elasticsearch; save these passwords somewhere.

./elasticsearch-setup-passwords auto

Initiating the setup of passwords for reserved users elastic,apm_system,kibana,logstash_system,beats_system,remote_monitoring_user.
The passwords will be randomly generated and printed to the console.
Please confirm that you would like to continue [y/N]y


Changed password for user apm_system
PASSWORD apm_system =XXXXXXXXX

Changed password for user kibana
PASSWORD kibana =XXXXXXXXX

Changed password for user logstash_system
PASSWORD logstash_system = XXXXXXXX

Changed password for user beats_system
PASSWORD beats_system = XXXXXXXXXX

Changed password for user remote_monitoring_user
PASSWORD remote_monitoring_user =XXXXXXXX

Changed password for user elastic
PASSWORD elastic = XXXXXXXXXXXX

Install Enterprise Search

Now, we’ll install Elastic Enterprise Search. Open firewall port 3002 to the public IP address of your server. This is the web interface for Enterprise Search.

wget https://download.elastic.co/downloads/enterprisesearch/enterprise-search-7.5.0.tar.gz

cd /usr/share/elasticsearch

tar xvfx enterprise-search-7.5.0.tar.gz

Make these changes:

cd enterprise-search-7.5.0

vim config/enterprise-search.yml

ent_search.auth.source: standard
elasticsearch.username: elastic
elasticsearch.password: oe4emGR6Wnwp1wEwiRle
allow_es_settings_modification: true
ent_search.listen_host: 172.31.46.15
ent_search.external_url: http://walkercodetutorials.com:3002

Choose a password and start Enterprise Search as shown below. This command looks a little awkward but this is how you both set up an initial password and provide the password on subsequent starts.

ENT_SEARCH_DEFAULT_PASSWORD=password bin/enterprise-search

To run it in the background, e.g., after you have finished the setup, do:

env ENT_SEARCH_DEFAULT_PASSWORD=password nohup bin/enterprise-search&

Now login using:

userid: enterprise_search
password: password

 to http://(your server):3002

It’s important to look at stdout when you start the server to make sure it echoes this password. If you don’t see this message, erase the software and then delete the indexes that Enterprise Search created in Elasticsearch as shown in the Debugging section below.

filebeat.1   | #########################################################
filebeat.1   | 
filebeat.1   | *** Default user credentials have been setup. These are only printed once, so please ensure they are recorded. ***
filebeat.1   |       username: enterprise_search
filebeat.1   |       password: password
filebeat.1   | 
filebeat.1   | #########################################################

Here is the login screen:

Here is the landing page:

In the next post, I’ll show how to configure Enterprise Search to query Google Drive, Dropbox, and Github.

Debugging Enterprise Search

If anything goes wrong with the Enterprise Search installation, you must delete the indexes that created in Elasticsearch before you repeat the installation.

You can list those indexes like this. Because you turned on security, you need to enter the userid and password. Use the Elasticsearch password auto generated above, not the Enterprise Search one you made up.

curl -X GET "http://(your server):9200/.ent-search*?pretty" -u elastic:(elasticsearch password, not the enterprise search password)

Then, delete all of them:

curl -X DELETE  "http://(your server)9200/.ent-search*" -u 
elastic:(elasticsearch password, not the enterprise search password)

Now, reinstall Enterprise Search.

]]>
What is the ELK Stack? https://www.bmc.com/blogs/elk-stack/ Tue, 01 Oct 2019 00:00:09 +0000 https://www.bmc.com/blogs/?p=15540 A stack is any collection of software products that are designed to work together, such as the popular LAMP stack, comprised of Linux, Apache, MySQL, and PHP. The ELK stack includes ElasticSearch, LogStash, and Kibana. ELK is one of the most widely used stacks for processing log files and storing them as JSON documents. It […]]]>

A stack is any collection of software products that are designed to work together, such as the popular LAMP stack, comprised of Linux, Apache, MySQL, and PHP. The ELK stack includes ElasticSearch, LogStash, and Kibana.

ELK is one of the most widely used stacks for processing log files and storing them as JSON documents. It is extremely configurable, versable, and scalable. It can be simple to use or complex, as it supports both simple and advanced operations.

(This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.)

ELK is best understood by looking at the pieces in the following diagram.

Going left to right and top to bottom we have the following components:

  • Logs, sensor output, and application files. This list illustrates how ElasticSearch can be used to store all kinds of data. It can be used for cybersecurity or system monitoring. It can store application data of any kind. This is because it’s a JSON database, much like MongoDB, which is designed to store any kind of information, since JSON has no rules with regards to structure. ElasticSearch scales to enormous volumes, too, so you could use it to capture sensor data from industrial devices and all types of machinery and other IoT (Internet of Things) device output and keep adding clusters and nodes to scale the system.
  • Apache Kafka. This can be put in front of ElasticSearch to store streaming data, providing a buffering and queuing mechanism.
  • Filebeat, Packetbeat, Metricbeat. You can program ElasticSearch yourself, using regular expressions or writing parsers in any language. Or you can use any of the Beats packages, which are pre-programmed to work with most common file types. Beats can send the data directly to ElasticSearch or to Logstash for further processing, which then hands it off to ElasticSearch.
  • Grok. One of the many ElasticSearch plugins, grok makes parsing log files easier by providing a programming language that is simpler than using regular expressions, which are notoriously difficult. Geoip is another plugin, which maps IP addresses to longitude and latitude and looks up the city and county name by consulting its database.
  • Logstash. You can use Logstash to work with log files directly or you can process them with any of the Beats first. For web server logs, Filebeat has an nginx module and modules for Apache. But, it does not parse the message fields into individual fields; Logstash does that. So, you could use one or both. For custom logs, for which you would have to write your own parser, you should use Logstash and grok.
  • ElasticSearch. The database ElasticSearch stores documents in JSON format. They are stored with an index and document type as well as a document id. They can be simple JSON or nested JSON documents.
  • Machine learning. ElasticSearch has added some machine learning capabilities to its product. For example, an Anomaly Detection plugin flags events in a time series log, such as a network router log, to identity those that are statistically significant and should be investigated. In terms of cybersecurity, this could indicate a hacking event. For a financial system this could indicate fraud.
  • Kibana. This is a graphical frontend for ElasticSearch. You might prefer to work with ElasticSearch using curl and JSON queries, or you can use the graphical Kibana interface, which makes browsing and querying data easier. This also lets you can dashboards and charts targeted for different audiences. So, you could have one set of dashboards for the cybersecurity team, another for the performance monitoring team, and another for the ecommerce team.

Indexes and Document Types

Each ElasticSearch document is stored under an index and document type, which is given in the URL. For example, the document below is index/type /network/_doc. Each document requires a unique identifier which is the _id field.

Below is a sample web server log JSON document.

"_index" : "network",
        "_type" : "_doc",
        "_id" : "dmx9emwB7Q7sfK_2g0Zo",
        "_score" : 1.0,
        "_source" : {
          "record_id" : "72552",
          "duration" : "0",
          "src_bytes" : "297",
          "host" : "paris",
          "message" : "72552,0,297,9317",
          "@version" : "1",
          "@timestamp" : "2019-08-10T07:45:41.642Z",
          "dest_bytes" : "9317",
          "path" : "/home/ubuntu/Documents/esearch/conn250K.csv"
        }

ElasticSearch Schemas

ElasticSearch is a noSQL database, which means it does not require a schema. So, ElasticSearch will take a JSON document and automatically set up an index mapping.

Index mapping is advantageous, but sometimes you need to nudge it in the right direction because you may not always want an automatically created index map. With especially complicated nested JSON documents or documents with an array of other JSON documents, ElasticSearch might flatten fields that you want to set up and arrays. That could lead to misleading or incorrect query results. So, you can set this mapping up yourself, which we explain in several articles within this guide (see navigation on the right side).

Kibana

Kibana lets you query documents using an easy-to-understand Lucene query language. This is opposed to using the more complex, more powerful DSL syntax written in JSON, which typically uses curl. For example, just type the word water in Kibana and any document that contains the word water will be listed.

The image below shows one document in Kibana.

Querying ElasticSearch

You can query ElasticSearch using Kibana, by writing JSON queries, or by passing queries as command line arguments. The most common way to query ElasticSearch is to use curl. For example, here are some queries:

List all indexes curl -X GET ‘http://localhost:9200/_cat/indices?v’
query by passing parameters curl -X GET http://localhost:9200/samples/_search?q=school:Harvard
query by writing queries arguments as a JSON curl -XGET –header ‘Content-Type: application/json’ http://localhost:9200/samples/_search -d ‘{

“query” : {

“match” : { “school”: “Harvard” }

}

}’

 

Writing Data to ElasticSearch

You write document using curl as well. Here, we are writing document _id = 1 index = samples and type = _doc.

LogStash

To get an understanding of how to work with Filebeats and Logstash, here is a sample Logstash config file. This one parses a csv file and writes it to the ElasticSearch server at IP address and port paris:9200.

input {
  file {
    path => "/home/ubuntu/Documents/esearch/conn250K.csv"
    start_position => "beginning"
  }
}

filter {
      csv {
        columns => [ "record_id", "duration", "src_bytes", "dest_bytes" ]
     }
    }

output {
  elasticsearch { 
  hosts => ["parisx:9200"] 
  index => "network"
  }

 }
]]>
How to Load CSV File into ElasticSearch with Logstash https://www.bmc.com/blogs/elasticsearch-load-csv-logstash/ Thu, 22 Aug 2019 00:00:06 +0000 https://www.bmc.com/blogs/?p=15284 Here we show how to load CSV data into ElasticSearch using Logstash. The file we use is network traffic. There are no heading fields, so we will add them. (This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.) Download and Unzip the Data Download this file eecs498.zip from Kaggle. Then […]]]>

Here we show how to load CSV data into ElasticSearch using Logstash.

The file we use is network traffic. There are no heading fields, so we will add them.

(This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.)

Download and Unzip the Data

Download this file eecs498.zip from Kaggle. Then unzip it. The resulting file is conn250K.csv. It has 256,670 records.

Next, change permissions on the file, since the permissions are set to no permissions.

chmod 777 conn250K.csv

Now, create this logstash file csv.config, changing the path and server name to match your environment.

input {
  file {
    path => "/home/ubuntu/Documents/esearch/conn250K.csv"
    start_position => "beginning"
  }
}

filter {
      csv {
        columns => [ "record_id", "duration", "src_bytes", "dest_bytes" ]
     }
    }

output {
  elasticsearch { 
  hosts => ["parisx:9200"] 
  index => "network"
  }

 }

Then start logstash giving that config file name.

sudo bin/logstash -f config/csv.conf

While the load is running, you can list some documents:

curl XGET http://parisx:9200/network/_search?pretty

results in:

"_index" : "network",
        "_type" : "_doc",
        "_id" : "dmx9emwB7Q7sfK_2g0Zo",
        "_score" : 1.0,
        "_source" : {
          "record_id" : "72552",
          "duration" : "0",
          "src_bytes" : "297",
          "host" : "paris",
          "message" : "72552,0,297,9317",
          "@version" : "1",
          "@timestamp" : "2019-08-10T07:45:41.642Z",
          "dest_bytes" : "9317",
          "path" : "/home/ubuntu/Documents/esearch/conn250K.csv"
        }

You can run this query to follow when the data load is complete, which is when the document count is 256,670.

curl XGET http://parisx:9200/_cat/indices?v

Create Index Pattern in Kibana

Open Kibana.

Create the Index Pattern. Don’t use @timestamp as a key field as that only refers to the time we loaded the data into Logstash. Unfortunately, the data provided by Kaggle does not include any date, which is strange for network data. But we can use the record_id in later time series analysis.

Now go to the Discover tab and list some documents:

In the next blog post we will show how to use Elasticsearch Machine Learning to do Anomaly Detection on this network traffic.

]]>
ElasticSearch Machine Learning https://www.bmc.com/blogs/elasticsearch-machine-learning/ Mon, 05 Aug 2019 00:00:05 +0000 https://www.bmc.com/blogs/?p=14935 Here we discuss ElasticSearch Machine Learning. ML is an add-on to ElasticSearch that you can purchase with a standalone installation or pay as part of the monthly Elastic Cloud subscription. (This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.) ElasticSearch Machine Learning The term machine learning has a broad definition. […]]]>

Here we discuss ElasticSearch Machine Learning. ML is an add-on to ElasticSearch that you can purchase with a standalone installation or pay as part of the monthly Elastic Cloud subscription.

(This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.)

ElasticSearch Machine Learning

The term machine learning has a broad definition. It is a generic term handed over to the laymen as a way of avoiding discussing the specifics of the various models.

To be specific what ElasticSearch ML does is unsupervised learning time series analysis. That means it draws conclusions from a set of data instead of using training a model (i.e., supervised learning) to make predictions, like you would with regression analysis using different techniques, including neural networks, least squares, or support vector machines.

How is this Useful

To see the value of unsupervised learning time series analysis consider the typical approach to cybersecurity or application performance monitoring. That is to assume data follows a normal (gaussian) distribution and then select some threshold that one considers significant when flagging outliers.

This is the familiar bell curve approach. Even if one does not think of it that way, that’s what they are doing. This curve is described by terms the highschool math student should know: mean, variance, and standard deviation. But that is not machine learning.

Not Better than Guessing

Typically someone doing application monitoring or cybersecurity flags an event when it lies at either end of the curve, which is where the probability of such an event is low. That’s usually referred to as some multiple of σ (sigma), where sigma is a multiple of standard deviations, where the probability of an event lying there is very low.

You could also with derision call this approach guessing.

The threshold approach is flawed, because it leads to false positives. That causes analysts to spend time tracking down events that are not truly statistically significant.

For example, with cybersecurity, just because someone is sending data to a particular IP address more now than before does not mean that event is out of bounds. They need to consider the cyclical nature of events and look at current events in light of what has come before them. It could be this happens every month and is normal. A clever algorithm can do that by, for example, applying a least squares method and looking to minimize the error (i.e., the difference between what is observed and what is expected), against a shifting subset of data. This is how ElasticSearch does it. And they apply several algorithms, not just least squares.

How to use ElasticSearch to do Machine Learning

Here we show how to use the tool. In another blog post we will explain some of the logic and algorithms behind it. But the tool is supposed to make it not necessary to understand all of that. Still, you should so you can trust what it is telling you.

We are going to draw from this video by ElasticSearch and slow it down to single steps to make it simpler to understand.

For data we will use the New York City taxi cab dataset that you can download from Kaggle here. The data gives us pick up and drop off times and locations for NYC cabs over a period of a few years. We want to see when traffic falls off or increases in such a way that is an abnormality, like a taxi strike or snowstorm.

There are some limits with loading this data in the Kibana ML screen, which is a feature with ES ML. You can only load 100 MB of data at a time. The taxi cab data is much larger than that. It includes a test and train dataset. Since we are not training a model with unsupervised learning, we will just pick one of them. And since we need to limit it to 100 mb we will split the 200 mb data file like this and just pick one 90MB dataset.

unzip train.zip
split -b 90000000 train.csv

Also note that ElasticSearch tends to freeze up when you load data like this, unless you have a large cluster. But it still loads the data. So once it looks like it has finished loading, meaning the screen no longer updates, just click out of it and to go to Index management to see how many records are in the new index. It should be about 100,000 for each 90MB of data.

Upload the Data into the Machine Learning Screens

Open Kibana and click on the Machine Learning icon. You will have that icon even if you don’t have a trial or paid license. But what will be missing is the Anomaly Detection, job creation, and other screens. So ElasticSearch will guide you through signing up for a trial.

From the Data Visualizer select Import a file and import one of the files you split from train.csv. Or if you loaded the data a different way you can use the Select an Index Pattern option.

ElasticSearch will show you the first 1,000 rows and then make some quick record counts.

Then click Import at the bottom of the screen. Give it a name for the Index Pattern name, like ny*.

Anomaly Detection

Now we get to the interesting part. We want ElasticSearch to look at this time series data. Pick the Single Metric option.

Then we have to create a job. ElasticSearch will run an aggregation query in the background. We tell it to sum passenger_count over time. (If the drop down box does not work just copy and paste the field name.)

We use this aggregation to observe when there is a drop off or spike in passengers that lies outside the normal range and that takes into consideration the normal rise and fall of passenger count over time.

Click the button Use Full Range of Data, so it will pick all the data available and not the date interval you manually put. You can change the bucket option to 10m.

Then the screen will fill out like this:

Then click Create Job. It starts running its algorithms and updating the display with vertical colored lines as it completes its logic.

View the Anomaly Detection Results

When it is done click View Results.

We will explain in another post exactly what calculations it has done, after all you should not trust an ML model without having some understanding of how it drew its conclusions.

But you can think of it like this. Taxi rides go up and down during the week versus the weekend and rush hour versus not. If you were working with sales data you would call that trends or seasonality. So if you used a gaussian (normal) distribution against that it would be wrong, as the mean and variance would be over the whole set including the high and low points. So the improved algorithm slides along making multiple normal curves (and other frequency distributions) against subsets of the data to eliminate this up and down pattern.

This produces the results below. The light blue area is the shifting probability distribution function. The red area is where you can see is the anomaly, the point where the plot has gone outside the curve. It’s also the point where we have run out of data. It has flagged that as an outlier by calculating an anomaly score, a point we will cover in the next post.

]]>
Using Kibana to Execute Queries in ElasticSearch using Lucene and Kibana Query Language https://www.bmc.com/blogs/elasticsearch-lucene-kibana-query-language/ Thu, 18 Jul 2019 00:00:17 +0000 https://www.bmc.com/blogs/?p=14797 We have discussed at length how to query ElasticSearch with CURL. Now we show how to do that with Kibana. You can follow this blog post to populate your ES server with some data. (This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.) Using JSON JSON queries (aka JSON DSL) […]]]>

We have discussed at length how to query ElasticSearch with CURL. Now we show how to do that with Kibana.

You can follow this blog post to populate your ES server with some data.

(This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.)

Using JSON
JSON queries (aka JSON DSL) are what we use with curl. But you can use those with Kibana too. It will not work with aggregations, nested, and other queries.

In using JSON, difference is that you only pass in the query object. So for this curl query:

{"query":{"match":{"geoip.country_name":"Luxembourg"}}}

You would paste in only this portion in Kibana.

{"match":{"geoip.country_name":"Luxembourg"}}

Entering Queries in Kibana
In the Discovery tab in Kibana, paste in the text above, first changing the query language to Lucene from KQL, making sure you select the logstash* index pattern. We discuss the Kibana Query Language (KBL) below.

If you forget to change the query language from KQL to Lucene it will give you the error:

Discover: input.charAt is not a function. (In 'input.charAt(peg$currPos)', 'input.charAt' is undefined)

The easiest way to enter the JSON DSL query is to use the query editor since it creates the query object for you:

Save the query, giving it some name:

Kibana Query Language (KBL) versus Lucene
You can use KBL or Lucene in Kibana. They are basically the same except that KBL provides some simplification and supports scripting.

Here are some common queries and how you do them in each query language.

KBL Lucene Explanation
request:”/wordpress/” request:”/wordpress/” The colon (:) means equals to.
Quotes mean a collection of words, i.e. a phrase.
request:/wordpress/ request:/wordpress/ Do don’t need quotes for one word.
request:/wordpress/ request:/wordpress/ Do don’t need quotes for one word.
request:/wordpress/ and response:404 request:/wordpress/ and response:200 for KBL you have to explicitly put the boolean operator. For Lucene the operator is not recognized as an operator but as a string of text unless you use write it in capital letters.
wordpress wordpress Matches based on any text (wordpress in this example) in the document and not a specific field.
200 or 404 200 404 adding the word or to Lucene would also include text containing the string “or.” So leave it off or use capital OR.
200 and 404 200 AND 404 Use uppercase with Lucene for logical operators.
geoip.country_name: “Luxembourg” {“match”:{“geoip.country_name”: “Luxembourg”}} Lucene supports JSON DSL query language, as we illustrated above
response:>=200 and response:< =404 response:[200 TO 404] range query
kilobytes > 1 not supported Scripted field, where kilobytes is:
if (doc[‘bytes’].size()==0) { return 0;
}
return doc[‘bytes’].value / 1024;
]]>
ElasticSearch Aggregations Explained https://www.bmc.com/blogs/elasticsearch-aggregation/ Mon, 08 Jul 2019 00:00:51 +0000 https://www.bmc.com/blogs/?p=14583 ElasticSearch lets you do the equivalent of a SQL GROUP BY COUNT and AVERAGE functions. They call these aggregations. (This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.) In other words, if you are looking at nginx web server logs you could: group each web hit record by the city […]]]>

ElasticSearch lets you do the equivalent of a SQL GROUP BY COUNT and AVERAGE functions. They call these aggregations.

(This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.)

In other words, if you are looking at nginx web server logs you could:

  • group each web hit record by the city from where the user came
  • count them

So this would give you something like:

City web hits
Paris 20
London 30
Berlin 40

In SQL this would be something like:

select city, count(city) from logs
 group by city

Here we illustrate this using the simplest use case, web logs. Follow the previous doc to populate your ElasticSearch instance with some nginx web server logs if you want to follow along.

Aggregation
Because ElasticSearch is concerned with performance, there are some rules on what kind of fields you can aggregate. You can group by any numeric field but for text fields that have to be of type keyword or have fielddata=true.

You can think of keyword as being like an index. When we loaded the nginx data, we did not create the index mapping first. We let ElasticSearch build that on-the-fly. So it chose to index all the text fields since it map those of type keyword, like this:

"geoip" : {
          "dynamic" : "true",
          "properties" : {
            "city_name" : {
              "type" : "text",
              "norms" : false,
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            }

Now, aggregations is a complicated topic. There are many variations. You can nest them to build up complex queries. Here we look at the simplest, most common use case: bucket aggregation. That is like the select, count SQL statement to produce a count by value.

In the example below, we want to count web hits by the city name.

The general pattern to build up the statement is:

  • Use aggs, which is short for aggregations.
  • Give the agg a name. Here we use cityName.
  • Tell it what field to use. Here we use the dot notation geoip.city_name since city_name is a property of geo_ip. In other words in a deeply nested JSON structure put a dot as you go down the hierarchy. For a JSON array, you would use ElasticSearch scripting, a topic we have not covered yet.
  • Add the word keyword, to tell it to use that index.
  • Give it the aggregation operation type. here we use terms. You can also use ave (average) and some others.

You can also use filters, which we illustrate further below.

So this query products of a count of web hits by city.

curl -XGET --user $pwd --header 'Content-Type: application/json'  https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/logstash/_search?pretty -d '{
        "aggs": {
                "cityName": {
                        "terms": {
                                "field": "geoip.city_name.keyword",
                                "size": 10
                        }

Here are the results. We gave it the default size of 10, meaning how far it should go. Since we have 18 cities in our data, “sum_other_doc_count” : 8 means it left off 8 records. Remember that ElasticSearch has many rules to keep performance high.

Notice that under each with these is a doc_count. So we had 6 web hits from the city of Montpellier.

"aggregations" : {
    "cityName" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 8,
      "buckets" : [
        {
          "key" : "Montpellier",
          "doc_count" : 6
        },
        {
          "key" : "Suzano",
          "doc_count" : 3
        },
        {
          "key" : "Boardman",
          "doc_count" : 2
        },
        {
          "key" : "Mazatlán",
          "doc_count" : 2
        },
        {
          "key" : "New York",
          "doc_count" : 2
        },
        {
          "key" : "San Diego",
          "doc_count" : 2
        },
        {
          "key" : "Abadan",
          "doc_count" : 1
        },
        {
          "key" : "Ashburn",
          "doc_count" : 1
        },
        {
          "key" : "Bogotá",
          "doc_count" : 1
        },
        {
          "key" : "Cambe",
          "doc_count" : 1
        }
      ]
    }
  }
}


Here we count the same data by response, e.g., 200, 404, etc. Since respond is of type long it’s not necessary to add keyword to the end.

curl -XGET --user $pwd --header 'Content-Type: application/json'  https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/logstash/_search?pretty -d '{
        "aggs": {
                "responses": {
                        "terms": {
                                "field": "response",
                                "size": 10
                        }
                }
        }
}'

Here are the results. Since there are only 4 types of responses in our data it showed all of them since that is < 10.

"aggregations" : {
    "responses" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 200,
          "doc_count" : 23
        },
        {
          "key" : 400,
          "doc_count" : 13
        },
        {
          "key" : 404,
          "doc_count" : 8
        },
        {
          "key" : 304,
          "doc_count" : 5
        }
      ]
    }
  }


Adding a Query

Here is an example using a query to filter the results. This is from FDA drug interaction data that we will explore in our upcoming posts on analytics.

This query counts drug interactions by drug type, which is the whole purpose of the FDA database, to track drug interactions and side effects. So this is a bucket type aggregation query and subquery

curl -XGET --user $pwd --header 'Content-Type: application/json'  https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/fda/_search?pretty -d '{
"query": {
    "term": {
      "patient.drug.medicinalproduct.keyword": {
        "value": "METRONIDAZOLE"
      }
    }
  },
      "aggs" : {
         "drug": {
             "terms" : {
                "field": "patient.drug.medicinalproduct.keyword"

           }
         },
         "adverseaffect" : {
             "terms" : {
               "field": "patient.reaction.reactionmeddrapt.keyword"
             }
       }
   }
}'


]]>
Using Beats and Logstash to Send Logs to ElasticSearch https://www.bmc.com/blogs/elasticsearch-logs-beats-logstash/ Fri, 05 Jul 2019 00:00:08 +0000 https://www.bmc.com/blogs/?p=14569 Here we explain how to send logs to ElasticSearch using Beats (aka File Beats) and Logstash. We will parse nginx web server logs, as it’s one of the easiest use cases. We also use Elastic Cloud instead of our own local installation of ElasticSearch. But the instructions for a stand-alone installation are the same, except […]]]>

Here we explain how to send logs to ElasticSearch using Beats (aka File Beats) and Logstash. We will parse nginx web server logs, as it’s one of the easiest use cases. We also use Elastic Cloud instead of our own local installation of ElasticSearch. But the instructions for a stand-alone installation are the same, except you don’t need to user a userid and password with a stand-alone installation, in most cases.

We previously wrote about how to do parse nginx logs using Beats by itself without Logstash. You might wonder why you need both. The answer it Beats will convert the logs to JSON, the format required by ElasticSearch, but it will not parse GET or POST message field to the web server to pull out the URL, operation, location, etc. With logstash you can do all of that.

So in this example:

  • Beats is configured to watch for new log entries written to /var/logs/nginx*.logs.
  • Logstash is configured to listen to Beat and parse those logs and then send them to ElasticSearch.

(This article is part of our ElasticSearch Guide. Use the right-hand menu to navigate.)

Download and install Beats:

wget https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-7.1.1-amd64.deb
sudo dpkg -i filebeat-7.1.1-amd64.deb

You don’t need to enable the nginx Beats module as we will let logstash to do the parsing.

Edit the /etc/filebeat/filebeat config file:

You want to change is the top and bottom sections of the file. Below we show that in two separate sections. First the top:

The important items are:

 enabled: true

Otherwise it will do nothing.

 – /var/log/nginx/*.log

You can list which folders to watch here.  Put each one a line by itself. In this case we only list nginx.

#=========================== Filebeat inputs =============================

filebeat.inputs:

# Each - is an input. Most options can be set at the input level, so
# you can use different inputs for various configurations.
# Below are the input specific configurations.

- type: log

  # Change to true to enable this input configuration.
  enabled: true

  # Paths that should be crawled and fetched. Glob based paths.
  paths:
    - /var/log/nginx/*.log

Here you want to:

  • Rem out the ElasticSearch output we will use logstash to write there.
  • Unrem the Logstash lines.
  • Tell Beats where to find LogStash.
  • Make sure you rem out the line ##output.elasticsearch too.
#-------------------------- Elasticsearch output ------------------------------
##output.elasticsearch:
  # Array of hosts to connect to.
  # hosts: ["localhost:9200"]

  # Enabled ilm (beta) to use index lifecycle management instead daily indices.
  #ilm.enabled: false

  # Optional protocol and basic auth credentials.
  #protocol: "https"
  #username: "elastic"
  #password: "changeme"

#----------------------------- Logstash output --------------------------------
output.logstash:
  # The Logstash hosts
  hosts: ["localhost:5044"]

Now start Beats. The -e tells it to write logs to stdout, so you can see it working and check for errors.

sudo /usr/share/filebeat/bin/filebeat -e -c /etc/filebeat/filebeat.yml

Install LogStash

cd /usr/share
sudo mkdir logstash
sudo wget https://artifacts.elastic.co/downloads/logstash/logstash-7.1.1.tar.gz
sudo tar xvfz logstash-7.1.1.tar.gz

Now edit /usr/share/logstash/logstash-7.1.1/config/nginx.conf

The items to note are:

input tell logstash to listen to Beats on port 5044
filter {

grok {

In order to understand this you would have to understand Grok.  Don’t try that yet.  It’s a file parser tool. It basically understands different file formats, plus it can be extended. Use the example below as even the examples in the ElasticSearch documentation don’t work. Instead tech writers all use the same working example.
output {

elasticsearch {

hosts => [“https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243”]

user => “elastic”

password => “xxxxxxxx”

index => “logstash-%{+YYYY.MM.dd}”

}

stdout { codec => rubydebug }

}

This part is disappointing at ElasticSearch does not let you use the cloud.id and cloud.auth to connect to ElasticSearch, as does Beats.  So you have to give it the URL and the userid and password.  Use the same userid and password that you log into cloud.elastic.com with.

You could also create another user, but then you would have to give that user the authority to create indices. So using the elastic user is using the super user as a short log.

The index line lets you make the index a combination of the words logstash and the date. The goal is to give it some meaningful name. Perhaps nginx* would be better as you use Logstash to work with all kinds of logs and applications.

codec = rubydebug writes the output to stdout so that you can see that is it working.

input {
  beats {
    port => 5044
    host => "0.0.0.0"
  }
}

filter {
 grok {
   match => [ "message" , "%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}"]
   overwrite => [ "message" ]
 }
 mutate {
   convert => ["response", "integer"]
   convert => ["bytes", "integer"]
   convert => ["responsetime", "float"]
 }
 geoip {
   source => "clientip"
   target => "geoip"
   add_tag => [ "nginx-geoip" ]
 }
 date {
   match => [ "timestamp" , "dd/MMM/YYYY:HH:mm:ss Z" ]
   remove_field => [ "timestamp" ]
 }
 useragent {
   source => "agent"
 }
}

output {
  elasticsearch {
    hosts => ["https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243"]
    user => "elastic"
    password => "xxxxxxxx"
    index => "logstash-%{+YYYY.MM.dd}"
  }
 stdout { codec => rubydebug }
}

Now start Logstash in the foreground so that you can see what is going on.

sudo /usr/share/logstash/logstash-7.1.1/bin/logstash -f /usr/share/logstash/logstash-7.1.1/config/nginx.conf

Assuming you have some the nginx web server and some logs being written to /var/log/nginx after a minute or so it should start writing logs to ElasticSearch. Or you can download https://raw.githubusercontent.com/respondcreate/nginx-access-log-frequency/master/example-access.log to give it some sample entries.

Export your password and ElasticSearch userid into the environment variable:

export pwd="elastic:xxxxx"

Then query ElasticSearch and you should see the logstash* index has been created.

curl --user $pwd -XGET https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/logstash-2019.06.19-000001/_search?pretty

Now you can query that ElasticSearch index and look at one record. Below we have shortened the record so that you can see that it has parsed the message log entry into individual fields, which you could then query, like request (the URL) and verb (GET, PUT, etc.).

curl --user $pwd -X GET 'https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/logstash-2019.06.19-000001?pretty'
{
        "_index" : "logstash-2019.06.19-000001",
        "_type" : "_doc",
        "_id" : "9RZDcWsBDzKZEQI4C4Qu",
        
…

          "verb" : "GET",
          "input" : {
            "type" : "log"
          },
          "auth" : "-",
          "source" : "/var/log/nginx/another.log",
          "device" : "Other",
        

….

          "geoip" : {
            "timezone" : "Europe/Moscow",
            "latitude" : 55.7386,
            "location" : {
              "lon" : 37.6068,
              "lat" : 55.7386
            },
            "longitude" : 37.6068,
            "country_code2" : "RU",
            "ip" : "46.160.190.178",
            "continent_code" : "EU",
            "country_code3" : "RU",
            "country_name" : "Russia"
          },
          


…


          "response" : 200,
          "os_name" : "Windows",
          "major" : "52",
          "agent" : "\"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36\"",
          "build" : "",
          "clientip" : "46.160.190.178",
          "@version" : "1",
          "message" : "46.160.190.178 - - [19/Jun/2019:12:39:52 +0000] \"GET / HTTP/1.1\" 200 481 \"-\" \"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36\"",
          

…

          "request" : "/",

In Kibana it will look like this:

And like this when expanded:

]]>