Monitoring: the Holy Grail

•December 23, 2013 • Comments Off on Monitoring: the Holy Grail

Holy Grail - Monty Python Holy Grail DVD Cover Art.

What is the ultimate goal of Monitoring? From my many years within the IT industry I have seen this evolved over time. From simply knowing when something doesn’t ping to a full synthetic traversal of an application and all of its components. At the core though it has never changed. The ultimate goal is to utilize technology to automate and reduce man hours so that the human factor is reduced in the equation. Ultimately the human factor in the equation of monitoring is removed.

This is the Holy Grail of monitoring, to completely remove a human from having to watch a screen for an issue to occur. To put it simply, we only want to know when there is an issue the needs action from a human to resolve it. For example if a server reboots in a maintenance window we do not need to wake up everyone to tell them that a critical server/application is having issues. Or the application owner wants to know when his services didn’t start correctly after that reboot. So we build additional layers of intelligence into our software. Simple logic at first, applying maintenance windows, thresholds, etc. and these worked really well to reduce noise and free up a persons time. We then began to engage in higher logic, such as “If X happens then run Z script”. We also built into the software hierarchical structures that would allow issues up-stream to be inherited to those down-stream so you could see a better picture of what was affected by an outage. These types enhancements , among others, have brought us so much closer to the Holy Grail.

The latest advancements have brought in great automation, in being able to join different systems together to remove even more human interaction with the monitoring.  For example, the monitoring system detects an issue which kicks off chain reaction of automatons between multiple different Operations systems. A ticket is opened with the Service Desk that automatically fills out and assigns to appropriate individuals.  Then a set of actions are performed in an attempt to resolve the issue. Success or failures of said actions are updated to the tickets without human intervention.  This great automation and significantly reduces the the need for human interaction and makes things move more smoothly.

For those of us who manage such systems we often have the challenge of setting up the logic to address the possible scenarios. This can be difficult if you do not have the full co-operations of key personal. I will get straight to he point. The key to achieving the Holy Grail is co-operation with those individuals and teams that would benefit most from having amazing systems that have been developed. Teamwork, if you will, is the fundamental part of achieving any goal within a group of individuals. I have had great success’s and monumental failures with implementing monitoring solutions. In both cases Teamwork was what allowed either to happen. All have to be willing to put in the effort to make the system work and if any one part does not, it will fail. This is why when I setup a new system, start a new project, or even minor upgrade, the most of my efforts is ensuring that I have full support of all parties to have the best chance for success.

More often than I would like, friction occurs and sometimes you cannot make the solution work. People have their own preferences and ideals and no matter how much customization you do, you cannot turn an apple into an orange. In those cases, I have found it optimal to provide the greatest solution you can, (best apple in the bushel) and present it to them in the manor that makes it the most appetizing. You can’t make them eat it,  but you can make it as irresistible as possible. This way if they still reject it, you know that you did your best and others will too. This will insulate you a bit from reprisals of failure.

As a child I would sit in my Great Grandmothers garden and watch the humming bird feeders. I kept seeing all kinds of insects around the feeders in addition to the birds. I asked my grandmother why “Everything” loved the humming bird feeders more than the followers. Her response stuck with me my entire life, “You will always catch more with something sweet, than with something sour.” This is, of course, a variation on the old saying ” You will catch more flies with honey than vinegar” and has rang true my entire life. I have found that in difficult situations where its a matter of preference, you will always get more acceptance from going the extra steps than by having a rule, law, or edict to accomplish the goal.

The Holy Grail of Monitoring will be amazing when it is accomplished. However, it will take a great deal of effort from all parts of an organization to make it a reality. Making a system that doesn’t require human interaction to resolve even 75% of the issues would be a great achievement, and this would be the beginning of what an organization would be able to accomplish if they worked that well together. Because no matter how much you would like to, you cannot remove the Human Factor from the equation.  The human factor is what gives purpose and HEART to any project. We can never forget that the purpose of any system is make things better for those who use the system.

Happy Holidays!

ICE – Top 3 “Must Haves” in a Monitoring Software Solution

•December 3, 2013 • Leave a Comment

I have worked with many different monitoring solutions over the course of my career, and have decided to put together a Top 3 “Must Haves” for any Monitoring software. These would be the Top 3 questions I would use in evaluating the software. Why 3 you may ask? Well three points keeps it easy to remember when you are evaluating it. They also fit well in a TLA (Three Letter Acronym) that can make it easier to remember. For the Top 3 I came up with I.C.E.  Below are my Top 3 “Must Haves”.

  1. I is for Intelligence: Is the software intelligent, that is does it help automate your monitoring? Example: Does the solution have a flood gate that closes when a large number of alerts occur.
  2. C is for Customizable‎: Can you customize the solution with relative ease? Example: Can you change the wording of alerts to reflect your organizations terminology? Does it have the ability for you to be able to include “Tribal Knowledge” into the system? Can you customize the Dashboard views to fit you specific needs? Does it allow you customize reports easily?
  3. E is for Extendable: Is the solution an extension of your core operations or is clunky and cumbersome requiring more resources than it benefits. Does the software allow for extensions other core operation solutions. Example: Integrates seamlessly with your ticketing system to help automate Service Delivery so fewer hands have to get involved to resolve an issue.

Over the course of my career I would say that there has been few solutions that hit 100% on all three criteria. One think I take note of is the institutions that utilize those solutions that do, have been top performers in their industries. They also have the smoothest running operations I have worked with.

Best of Luck and Happy Holidays!

PowerShell Scipt: Restart OpsMgr 2012 Services

•September 13, 2013 • 3 Comments

So I have been having a pesky time tracking out exactly why after normal OS patching and reboot SCOM 2012 email notifications get hung up. It doesn’t happen every time the server reboots and it doesn’t occur on service restarts either.  Nothing in the event logs to show why they wouldn’t be working properly. Since I cannot reproduce the issue on command, I had to create a bandaid to keep the notifications going out as expected.

So, below is my short term solution. I added the this script to a Scheduled Task to run on start-up.  This way the server will restart the services shortly after reboot occurs such as with WSUS Patching.  This will at least keep the notifications flowing until I can find the culprit.

###################################################
## OpsMgr_ServiceRestart.ps1
## Version 1.1
## PowerShell Script to restart services OpsManager 2012 Services. This can be used in
## conjuncture with Scheduled Task to restart services after a reboot
##
##
## Author: Josh Ancel
## Date Written: 9/13/2013
##
##
###################################################
#Stopping Services System center Data Access(OMSDK), System Center Manament APM, and System Center Managment(HealthService)
##Write-Host "Stopping Services"
stop-service omsdk, 'System Center Management APM'
start-sleep -s 45
##Write-Host "Waiting..."
stop-service HealthService
##Write-Host "Waiting 3.5 min before starting services...."
start-sleep -s 210
# Start Services
##Write-Host "Starting Services"
Start-service HealthService
##Write-Host "Waiting 3.5 min......again....."
start-sleep -s 210
start-service omsdk, 'System Center Management APM'
##Write-Host "Completed"</code>

#EOF

I thought others might find the script handy so I shared. As with all my scripts feel free to use it, abuse it, or even fuse it.

Good Luck!

GUID Translate Script – PowerShell

•September 10, 2013 • 2 Comments

Wrote a quick little PowerShell script to translate the GUID OpsMrg 2012 spits out and give you the display name.

Note: This script is designed to be ran from the Operations Manger 2012 Shell / PowerShell

###################################################
## Version 1.1
## This is to take any GUID that SCOM spits out and translate it to its display name.
## You can paste your GUID directly into the OpsMgr Shell.
## This is designed to be ran from the Operations Manger Shell PowerShell CLI.
##
## Author: Josh Ancel
## Date Written: 9/10/2013
#
#
###################################################

$gUIDtrans = Read-Host “Enter the GUID you would Like Translated: ”

Get-SCOMClassInstance -id: $gUIDtrans | ft DisplayName

#EOF

Just a little time saver hope it helps.

-Good Luck!

Filter Event Log on Message string using PowerShell

•August 27, 2013 • Leave a Comment

I recently had the need to look for a very specific string out of the Message portion of the Security Event Logs.  Since you cannot filter events by only a portion of the Message section. (I actually had 4 different strings to search on) I came up with the below PowerShell script. I have changed it to be more generic and universal for everyone to easily use it.

###################################################
## Version 1.2
## PowerShell Script that collects eventlogs of specific ID. It then filters on date range
## and specified string within the Message portion of the event.
## It then outputs to HTML file, if you would rather a TXT file, you can uncomment the
## listed command below and comment out the ConvertTo-HTML.
##
##
## Author: Josh Ancel
## Date Written: 8/27/2013
##
##
###################################################

## Reads where the user would like to store the system.
Write-Host “Enter the path and file name where you would like to store the information.”
$fileLocation = Read-Host “Note!- File is HTML format. Example: C:\Temp\events.html ”

## Reads the event log that they want to serch.
$EventLog = Read-Host “Enter Event Log that you would like to seach. Example: Application or Security”

## Reads EventID’s from user
$EventID = Read-Host “Enter the EventID you would lilke to filter on. Example: 22222 ”

## Collect Date Range from user
$StartDate = Read-Host “Enter Start Date in format MM/DD/YYYY”
$EndDate = Read-Host “Enter End Date in format MM/DD/YYYY”

## Reads Message Filter from user. Wildcards can be used at start and end of string. If you need to mismatch the wildcards edit scritp to include additonal variables.
Write-Host “Please provide the string that you would like to filter on.”
Write-Host “Wild cards can only be accepted at start and end of string. If you need multiples you will neeed edit PowerShell Script.”
$MessageFilter = Read-Host ” Example: *Failed* ”
##Collect specified events, filter on date range, and filter for specific Message string
$SmartAudit = get-eventlog -logname security -After $StartDate -Before $EndDate | where-object {$EventID -contains $_.eventid -and $_.message -like $MessageFilter}

#Writes to file HTML File

$SmartAudit | ConvertTo-Html | Set-Content $fileLocation
Write-Host “Total number of matching events found = ” $LincAudit.count
Write-Host “Completed your file is ready for you to view in ” $fileLocation

##If you want to write to clear text use this command instead of the ConverTo-HTML
#$SmartAudit | FL | out-file $fileLocation

#EOF

Good luck!

Linux – Network Tap Server

•August 12, 2013 • Leave a Comment

construction-work-carpenter-tools

I just finished up my most recent project to implement a monitoring server on a Network “Tap” or Span Port. The need arose to be able to capture and filter data on segments of the Network in real-time. We needed a way to take a look at a specific VLAN and see what kind of traffic was going across it. This could have been accomplished with SolarWinds Netflow, but it was going to require a significant amount of adjustments to our current configuration to view all ports. Essentially, SolarWinds does a great job of telling you what you want to know about traffic. The problem is, that even though we have 5000+ ports defined in SolarWinds all traffic outside of that is lumped together as “Unmonitored Traffic”. If I knew what port or ports that we were targeting it would be easily setup to watch for traffic on those ports. However the problem was that we wanted to find out what we didn’t know. That is we want to find out what kind of traffic was going across the network we didn’t know about. (We don’t know what we don’t know.) So I developed Linux based Network Monitoring system that would be able to collect, store and report what we needed.

This was achieved utilizing several OpenSource tools on a Ubuntu 12.04 Linux Server/Desktop running on 6 year old HP DL380 G2. First and primary was EtherApe, this tool allows for real-time graphical representation of network traffic.  This can be extremely powerful tool when attempting to catch things occurring in real time. Second was TCPDump, which allows for long term collection of packet data to be analyzed after the event occurs.  Also worth noting and along the same lines as TCPdump is WireShark, however I have encountered too many times where WireShark will run for 48 hours then error out. TCPDump on the other hand I have seen run for a week without issue.  Also within the same category is IPTraf, and can provide a lot of good traffic details through CLI / SSH session. Lastly, the  Security guys asked if I would include PBnJ to help with the efforts of locking down systems within the network. PBnJ utilized NMAP, mySQL and some other tools to collect, store and compare targeted systems/IP ranges of open ports and vulnerabilities.

I was pleasantly surprised at how quickly I was able to put these tools together and implement them in my Proof of Concept Lab. There were few tweaks that were needed from out of the box installs of each application, but for the most part I was able to “apt-get  install” the tools and start using them right off the bat.

Once in production we found a good amount of unexpected traffic. I would highly recommend any network infrastructure team putting this type of server in place on their network.  This allows a window into the network traffic without having to load wireshark on each node that you are troubleshooting.  As I continue to configure and tweak the settings on these Apps, I will post them, so it might help save others time and trouble.

Good Luck!

SQL Query – OperationManager Database Statistics

•July 1, 2013 • 5 Comments

Thanks for taking my Poll!

OperationManager Database Statistics

A good friend of mine provided me with this SQL Query that will give you a bunch of Statistics about your Operations Manager Database. This is such great information I asked him if I could share it with the world and he said it was ok. Below is the query, quick note though you will want to choose the “Results to Text” from the query manager tool bar. Otherwise it will output in separate tables and can be difficult to read.

SQL Query :

— Some basic statistics queries
select ‘MP element numbers’
select ‘MP’, count(*) from ManagementPack
select ‘Discovery’, count(*) from Discovery
select ‘Monitor’, count(*) from Monitor
select ‘Rules’, count(*) from Rules
select ‘MonitorOverride’, count(*) from MonitorOverride
select ‘ModuleOverride’, count(*) from ModuleOverride
select ‘InstanceOverride’, count(*) from InstanceOverride

select ‘Agent numbers’
select v.IsAgent, v.IsGateway, v.IsManagementServer, count(*) from MTV_healthservice v
group by v.IsAgent, v.IsGateway, v.IsManagementServer
order by v.IsAgent, v.IsGateway, v.IsManagementServer

select ‘Computer numbers’
select count(*) from MTV_Computer

select ‘Action partitions’
select * from Partitiontables where iscurrent = 1

select ‘Table sizes’
SELECT s.object_id, o.type_desc, sum ( used_page_count ) * 8 as SizeKB,
sum(row_count) as [RowCount], object_name ( s.object_id ) AS TableName
FROM sys.dm_db_partition_stats s
inner join sys.objects o on s.object_id = o.object_id
WHERE index_id=0 or index_id=1
GROUP BY s.object_id, o.type_desc
ORDER BY TableName

select ‘Performance sample statistics’
select COUNT(*), min(timeadded), max(timeadded)
from PerformanceDataInsertView with(nolock)

select ‘Event statistics’
select COUNT(*), min(timeadded), max(timeadded)
from EventInsertView with(nolock)

select ‘Event Message statistics’
select
‘EventMsgLength’,
(len(LocalizedText.LTValue)/100)*100,
(len(LocalizedText.LTValue)/100)*100+100,
COUNT(distinct LocalizedText.LTStringId),
COUNT(*)
from EventInsertView with(nolock)
inner join PublisherMessages with(nolock)
on EventInsertView.PublisherId = PublisherMessages.PublisherId
and EventInsertView.FullNumber = PublisherMessages.MessageId
inner join LocalizedText with(nolock)
on LocalizedText.LTStringId = PublisherMessages.MessageStringId
group by (len(LocalizedText.LTValue)/100)*100
order by (len(LocalizedText.LTValue)/100)*100

select ‘Alert statistics’
select
DATEPART(yyyy,TimeAdded) AS ‘YEAR’,
DATEPART(mm,TimeAdded) AS ‘MONTH’,
DATEPART(dd,TimeAdded) AS ‘DAY’,
COUNT(*),
sum(Alert.RepeatCount + 1)
from Alert with(nolock)
group by
DATEPART(yyyy,TimeAdded),
DATEPART(mm,TimeAdded),
DATEPART(dd,TimeAdded)
order by
DATEPART(yyyy,TimeAdded),
DATEPART(mm,TimeAdded),
DATEPART(dd,TimeAdded)

select ‘State change statistics’
select
DATEPART(yyyy,TimeAdded) AS ‘YEAR’,
DATEPART(mm,TimeAdded) AS ‘MONTH’,
DATEPART(dd,TimeAdded) AS ‘DAY’,
COUNT(*)
from StateChangeEvent with(nolock)
group by
DATEPART(yyyy,TimeAdded),
DATEPART(mm,TimeAdded),
DATEPART(dd,TimeAdded)
order by
DATEPART(yyyy,TimeAdded),
DATEPART(mm,TimeAdded),
DATEPART(dd,TimeAdded)

select ‘Relationship Discovery statistics’
select
DATEPART(yyyy,ecl.LastModified) AS ‘YEAR’,
DATEPART(mm,ecl.LastModified) AS ‘MONTH’,
DATEPART(dd,ecl.LastModified) AS ‘DAY’,
ecl.ChangeType,
COUNT(*),
count(distinct ecl.EntityTransactionLogId)
from EntityChangeLog ecl with(nolock)
where ecl.RelationshipId is not null
group by
DATEPART(yyyy, ecl.LastModified),
DATEPART(mm,ecl.LastModified),
DATEPART(dd,ecl.LastModified),
ecl.ChangeType
order by
DATEPART(yyyy,ecl.LastModified),
DATEPART(mm,ecl.LastModified),
DATEPART(dd,ecl.LastModified),
ecl.ChangeType

select ‘Entity Discovery statistics’
select
DATEPART(yyyy,ecl.LastModified) AS ‘YEAR’,
DATEPART(mm,ecl.LastModified) AS ‘MONTH’,
DATEPART(dd,ecl.LastModified) AS ‘DAY’,
ecl.ChangeType,
COUNT(*),
count(distinct ecl.EntityTransactionLogId)
from EntityChangeLog ecl
where ecl.RelationshipId is null
group by
DATEPART(yyyy, ecl.LastModified),
DATEPART(mm,ecl.LastModified),
DATEPART(dd,ecl.LastModified),
ecl.ChangeType
order by
DATEPART(yyyy,ecl.LastModified),
DATEPART(mm,ecl.LastModified),
DATEPART(dd,ecl.LastModified),
ecl.ChangeType

select ‘Performane signature statistics’
select ‘Data’, COUNT(*) from PerformanceSignatureData with(nolock)
select ‘History’, COUNT(*) from PerformanceSignatureHistory with(nolock)

select ‘Instance space statistics’
select ‘BME’, COUNT(*) from BaseManagedEntity with(nolock)
select ‘TME’, COUNT(*) from TypedManagedEntity with(nolock)
select ‘Relationship’, COUNT(*) from Relationship with(nolock)
select ‘RecursiveMembership’, COUNT(*) from RecursiveMembership with(nolock)
select ‘DiscoverySourceToRelationship’, COUNT(*) from DiscoverySourceToRelationship with(nolock)
select ‘DiscoverySourceToTypedManagedEntity’, COUNT(*) from DiscoverySourceToTypedManagedEntity with(nolock)

I hope this helps others as much as it has me.

Good Luck!