Elasticsearch is an open-source search solution which is quite popular for centralzed logs ingestion. It allows logs from various different sources to be available and searchable at a centralized location.

Ingesting data from various sources though, creates a problem. How do you normalize the incoming data, split it into some common fields, add or remove metadata etc? To perform all of this (and more), Logstash is the go-to solution. It can manipulate, transform incoming data before pushing it off to Elasticsearch for indexing.

Kibana is the tool of choice to search and visualize data indexed in Elasticsearch. These 3 together, are often referred as the ELK stack.

The Problem

AWS now has a managed Elasticsearch offering which is a fantastic option for small teams with limited capacity to manage a self hosted Elasticsearch solution on EC2/ECS/EKS. There is a downside though. It only offers E_K from the ELK stack. Yes, there is no L (Logstash) out of the box. The only option is to self host a Logstash install - which kind of defeats the purpose of using a managed service.

The Solution

Enter Elasticsearch ingest pipelines. These are not a complete replacement of Logstash, but they can do the data transformation part quite easily, if that’s the only thing you use Logstash for. Best part being, they run on the same cluster as Elasticsearch.

In my project, logs are being aggregated from a number of different sources - java app logs, httpd logs, systemd logs, external system logs etc. While most of these are ingested using filebeat agents, external logs arriving in S3 are ingested via Lambda.

Using ingest pipelines, I can split the data in various fields by applying different grok patterns but indexing them back in the same index. At a high level, this is what the whole setup looks like.

image

Ingest Pipeline(s)

At its core, an ingest pipeline is a series of processors that are executed in order, to process/transform data. In this case, there are multiple ingest pipelines. The main pipeline accepts all incoming data, and based on some condition, will then invoke the sub-pipelines.

The some condition here is the value of the field logpattern. This will become more clear as we look at the configuration of each component in this whole setup.

Let’s start with creating the pipelines first. Then we’ll look at the entrypoints - filebeat and lambda - which read the logs and forward it to the Elasticsearch cluster.

Main Pipeline Configuration

Broadly, all pipelines have the same structure - description and a bunch of processors. In the case of master_pipeline, I have used the following -

  • drop - To prevent the document from getting indexed on some condition.
  • set - To add a new field, value can either be static or derived from other fields.
  • remove - To delete a field from the document before it is indexed.
  • pipeline - Trigger other ingest pipelines for further steps.
{
    "description" : "main pipeline",
    "processors" : [
      {
        "drop" : {
          "if" : "ctx.message.toLowerCase().contains('some unwanted data')"
        }
      },
      {
        "set" : {
          "field" : "hostname",
          "value" : "{{host.name}}"
        }
      },
      {
        "remove" : {
          "field" : [
            "host.containerized",
            "host.architecture",
            "host.hostname",
            "host.name",
            "host.os.codename",
          ]
        }
      },
      {
        "pipeline" : {
          "if" : "ctx.logpattern == 'java_log'",
          "name" : "java_log_pipeline"
        }
      },
      {
        "pipeline" : {
          "if" : "ctx.logpattern == 'httpd_log'",
          "name" : "httpd_log_pipeline"
        }
      },
      {
        "pipeline" : {
          "if" : "ctx.logpattern == 'system_log'",
          "name" : "system_log_pipeline"
        }
      }
      {
        "pipeline" : {
          "if" : "ctx.logpattern == 'external_log'",
          "name" : "external_log_pipeline"
        }
      }
    ],
    "on_failure" : [
      {
        "set" : {
          "field" : "Error",
          "value" : "{{_ingest.on_failure_message}}"
        }
      }
    ]
  }

Other pipelines also have the same structure, but different processors. The most important one being the grok processor. It is responsible for splitting the log entry into sub-fields which can then be searched or aggregated. Following is an example for the httpd_logs_pipeline.

{
    "description" : "httpd logs",
    "processors" : [
      {
        "gsub" : {
          "field" : "message",
          "pattern" : "\"",
          "replacement" : ""
        }
      },
      {
        "grok" : {
          "field" : "message",
          "patterns" : [
            "^%{IPV4:ipv4} - %{USER:username} %{HTTPDATE:datetime} %{PROG:method} %{URIPATHPARAM:request_uri} %{EMAILLOCALPART:http_version} %{NUMBER:http_status_code} %{PROG:pid} (%{NOTSPACE:request_by}|-) %{JAVALOGMESSAGE:useragent}$"
          ]
        }
      },
      {
        "set" : {
          "field" : "details",
          "value" : "{{ipv4}} - {{username}} {{method}} {{request_uri}} {{http_version}} {{http_status_code}} {{pid}} {{request_by}} {{useragent}}"
        }
      },
      {
        "date" : {
          "field" : "datetime",
          "formats" : [
            "dd/MMM/yyyy:HH:mm:ss Z",
            "ISO8601"
          ],
          "timezone" : "Europe/London"
        }
      },
      {
        "uppercase" : {
          "field" : "severity",
          "on_failure" : [
            {
              "set" : {
                "field" : "severity",
                "value" : "INFO"
              }
            }
          ]
        }
      },
      {
        "remove" : {
          "field" : "datetime"
        }
      }
    ]
}

Elasticsearch Configuration

Once ingest pipeline jsons are ready, it’s quite simple to add these to the Elasticsearch cluster. Following command adds a new pipeline

curl -X POST https://elasticsearch.example.com/_ingest/pipeline/pipeline_name -d pipeline_definition.json

Following command diplays currently configured pipelines on the Elasticsearch cluster.

curl https://elasticsearch.example.com/_ingest/pipeline

Filebeat Configuration

The source of logs are modified next to use the configured ingest pipelines.

Shown here are only a subset of all the propeties that might be required. Notice the field logpattern under fields - this ensures the new field is appended to every log entry. Further down, in the Elasticsearch section, index name and the pipeline name are specified.

- type: log
  enabled: true
  paths:
    - /opt/app/server/logs/stdout.log
  fields_under_root: true
  fields:
    logpattern: "java_logs"
    role: "java"
  multiline.pattern: ^\d{4}-\d{1,2}-\d{1,2}
  multiline.negate: true
  multiline.match: after

# Elasticsearch
output.elasticsearch:
  hosts: [ "https://elasticsearch.example.com:443" ]
  index: "logs-%{+yyyy.MM.dd}"
  pipeline: main_pipeline

Lambda Configuration

Second input source is Lambda, which uses the Elasticsearch bulk API to index the logs pushed to S3. In this case, similar to filebeat, a new field logpattern is added to each record. After that, during indexing, a new querystring with the pipeline name is added.

POST /index/_bulk?pipeline=master_pipeline

Conclusion

That’s it!

Now all incoming data gets transformed first by a combnation of ingest pipelines, before getting indexed in Elasticsearch. 👏

References (5)

  1. Pipeline 
  2. Ingest Processors 
  3. Ingest Apis 
  4. Filebeat Reference Yml 
  5. Docs Bulk