Monday, December 3, 2012

Negative GC observed in GC logs

Off late in our JEE Web application tests we observed Negative GC in the JBoss GC logs. After some reading we found that the clock of the Servers were not properly Synched up using Network  time Protocol. 

Possibly the time of the server had gone out of synch and NTP kicked in and corrected the time before GC finished. Note the values in the first column indicating the number of seconds from the JVM at which the GC ran. It also shows a similar trend. After 8175 seconds it shows 8137 seconds.

We are yet to validate if this server suffers from going out of synch by checking with the drift file generated.  Learnt that to enable drift file one an use this configuration driftfile /var/lib/ntp/ntp.drift in the ntp configuration file.



Thursday, November 22, 2012

Debugging a Communications Link failure issue for a long MySQL procedure

A long procedure executed from JBoss on MySQL instance fails every 2 hours in production environment. Find below the reason for the same.

There are three TCP related parameters in Linux OS which played role in this.

cat /proc/sys/net/ipv4/tcp_keepalive_time = 7200 seconds
cat /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 seconds
cat /proc/sys/net/ipv4/tcp_keepalive_probes = 9

Description - The first two parameters are expressed in seconds, and the last is the pure number. This means that the keepalive routines wait for two hours (7200 secs) before sending the first keepalive probe, and then resend it every 75 seconds. If no ACK response is received for nine consecutive times, the connection is marked as broken.


Conclusion - We need to know default settings in our environment and of course the procedure requires tuning :)

Friday, November 9, 2012

MySQL Query Profiling

Profiling is enabled per session. When a session ends, its profiling information is lost.

mysql> SELECT @@profiling;
+-------------+
| @@profiling |
+-------------+
|           0 |
+-------------+
1 row in set (0.00 sec)

mysql> SET profiling = 1;
Query OK, 0 rows affected (0.00 sec)

mysql> DROP TABLE IF EXISTS t1;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> CREATE TABLE T1 (id INT);
Query OK, 0 rows affected (0.01 sec)

mysql> SHOW PROFILES;
+----------+----------+--------------------------+
| Query_ID | Duration | Query                    |
+----------+----------+--------------------------+
|        0 | 0.000088 | SET PROFILING = 1        |
|        1 | 0.000136 | DROP TABLE IF EXISTS t1  |
|        2 | 0.011947 | CREATE TABLE t1 (id INT) |
+----------+----------+--------------------------+
3 rows in set (0.00 sec)

mysql> SHOW PROFILE;
+----------------------+----------+
| Status               | Duration |
+----------------------+----------+
| checking permissions | 0.000040 |
| creating table       | 0.000056 |
| After create         | 0.011363 |
| query end            | 0.000375 |
| freeing items        | 0.000089 |
| logging slow query   | 0.000019 |
| cleaning up          | 0.000005 |
+----------------------+----------+
7 rows in set (0.00 sec)

mysql> SHOW PROFILE FOR QUERY 1;
+--------------------+----------+
| Status             | Duration |
+--------------------+----------+
| query end          | 0.000107 |
| freeing items      | 0.000008 |
| logging slow query | 0.000015 |
| cleaning up        | 0.000006 |
+--------------------+----------+
4 rows in set (0.00 sec)

Refer - http://dev.mysql.com/doc/refman/5.0/en/show-profile.html

Tuesday, November 6, 2012

Linux - Drop caches.

Writing to this will cause the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free.
To free pagecache:
  • echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:
  • echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:
  • echo 3 > /proc/sys/vm/drop_caches
As this is a non-destructive operation, and dirty objects are not freeable, the user should run "sync" first in order to make sure all cached objects are freed.

This tunable was added in 2.6.16.

Referred - http://www.linuxinsight.com/proc_sys_vm_drop_caches.html

Sunday, October 28, 2012

Regarding favicon.ico and related 404.

In recent times the /favicon.ico hit was showing up many 404's in the Web servers access logs. Found that ga.js is accessing this link and it is not finding the icon in the default root of the site which it ideally should.

Performance tip - 
Make sure there’s a favicon.ico file in the root of the site, i.e. /favicon.ico. Browsers will automatically request it, even if the icon isn’t mentioned in the HTML at all. If you don’t have a /favicon.ico, this will result in a lot of 404s, draining your server’s bandwidth.

Refer -
http://programmers.stackexchange.com/questions/46716/what-should-every-programmer-know-about-web-development

Thursday, October 25, 2012

Persistent Connections and their Timeouts for Browsers IE and FF


Persistent Connections and their Timeouts for Browsers IE and FF

Internet Explorer
By default, versions 6 and 7 use 2 persistent connections
Version 8 uses 6.

Persistent connections time out after 60 seconds of inactivity which is changeable via the Windows Registry.

Mozilla Firefox
The number of simultaneous connections can be customized (per-server, per-proxy, total).
For e.g. For FF 13.0.1 -> network.http.max-persistent-connections-per-server = 8

Persistent connections time out after 115 seconds (1.92 minutes) of inactivity which is changeable via the configuration -> network.http.keep-alive.timeout in about:config

Thursday, October 18, 2012

Ways to reduce MySQL Master-Slave Replication Lag

Many a times batch jobs are written to do the following. In case of MySQL some of these can cause Master-Slave Replication lags -

1. Need to update a lot of rows
2. Need to select rows for modification
3. Need to insert and select
4. Need to execute batch jobs on master
Some of the ways to overcome this are detailed below.

1. Need to update a lot of rows

Solution -
We know MySQL Replication is single threaded. We should avoid any long running write query to prevent  replication stream clog.  One way to do this - Chop the query

Example -
If we need to update 10,000 rows then instead of doing this in the SAME query we can do this in say 10 Small and Short queries

2. Need to select rows for modification

If there is considerable effort needed to select rows for modification – spread it out and have separate select and update queries. In such case slave will only need to run UPDATE and not make Slave to do more work than it needs to do.

Example:
UPDATE posts SET  spam=1 WHERE body LIKE "%cheap rolex%"; TO

SELECT id FROM posts WHERE  body LIKE "%cheap rolex%";
UPDATE posts SET spam=1 WHERE id IN (list of ids)

3. Need to insert and select

Solution -

Best solution is to avoid INSERT … SELECT going through replication for reasons like long query time, waste of execution on slave, etc.

Other alternative is - Piping data through application is the best solution in many cases and is quite easy – it is trivial to write the function which will take SELECT query and the table to which to store its result set and use in your application in all cases when you need this functionality.

4. Need to execute batch jobs on master

We should not overload the replication. Quite typically we see replication lagging when batch jobs are running. These can load master significantly during their run time and make it impossible for slave to run the same load through single thread.

Solution -

i) Space out and slow down the batch jobs (such as adding sleep calls) to ensure there is enough breathing room for replication thread.

ii) Use controlled execution of batch job – If we can check the Slave lag every so often and pause if it becomes too large. This is a bit more complicated approach but it saves from adjusting the Sleep behavior to keep the progress fast enough and at the same time keep replication from lagging.


Refer -
http://www.mysqlperformanceblog.com/2008/09/22/fighting-mysql-replication-lag/
http://www.mysqlperformanceblog.com/2007/10/12/managing-slave-lag-with-mysql-replication/

Monday, October 1, 2012

File Fragmentation & MySQL Performance


Problem Description
---------------------------------------------------
Since several months we were observing a specific Select query that was tuned but running slow on couple of tables in MySQL 5.1.62 (Innodb engine) in our production database. Also, the queries continued to be slow after the database upgrade to MYSQL 5.5.24.

One of the table has 1.4 million rows of size 184 MB. The select query was tuned and using a ref join, scanning ~150 rows only. The Select query was taking 220 milliseconds. Since we had multiple select queries in a loop the time for the transaction inflated to more than 30 seconds.

We ran the Optimize command on the table and the .ibd file size did shrink in order of 20-25 MB. But, the the queries continued to be slow. We then dumped and restored the table and to our surprise the queries ran very fast. The performance gain was 10 times.


The root cause was File fragmentation.

Lessons learnt -
1) Learnt a core difference between a Dump/restore v/s Optimize.
2) Command named filefrag

1) DUMP and RESTORE causes the table to be rebuilt entirely from scratch, in the 5.5 row format.    It would have also gotten rid of any fragmentation that may have existed on the filesystem
(since we weren't doing a simple cp or mv call for the files on the filesystem, as happens with the ALTER/OPTIMIZE TABLE) as well as on any logical volume, and finally on the actual storage device as well.

2) filefrag (Linux)

Following was the output generated for two tables
> filefrag foo.ibd
Output > foo.ibd: 83 extents found, perfection would be 2 extents

> filefrag poo.ibd
Output > poo.ibd: 131 extents found, perfection would be 4 extents


Friday, September 7, 2012

How to Read GC Logs in Java ?

Java provides an built-in garbage collection logging system through which you can quickly and easily troubleshoot Java applications. It helps in understanding the application in terms of memory usage, heap size and GC metrics.

Following are the sample GC Logs

    12345.094: [GC 695486K->557348K(806720K), 0.0191830 secs]
    167741.549: [Full GC 695054K->51454K(806720K), 1. 8452160 secs]

Information provided by Sample 1 :

12345.094: [GC 695486K->507268K (803840K), 0.0178593 secs]

    12345.094 – Time stamp since the JVM has started
    GC – Partial GC occurrence
    695486K – Heap (in KB) in used before GC runs
    507268K – Heap (in KB) in used after GC runs
    803840K – Total JVM size
    0.0178593 secs – Time taken for GC run

Information provided by Sample 2 :

167741.549: [Full GC 695054K->551454K (806720K), 1. 8452160 secs]

    167741.549 – Time stamp since the JVM has started
    Full GC – Full GC occurrence
    695054K – Heap (in KB) in used before GC runs
    51454K – Heap (in KB) in used after GC runs
    806720K – Total JVM size
    1. 8452160 secs – Time taken for GC run

Friday, August 17, 2012

A useful service that can help you analyze any site.

Statsgram.com is a useful service that can help you analyze any site.

I particularly liked the Summary and the "Search Queries" section.

Thursday, August 16, 2012

How to set the max heap size in C#?

I had this basic question  How to set the max heap size in C#? which got resolved by a link shared by my colleague.

"The .Net heap will grow, as needed, to consume all available memory. There's
no need to manually adjust it.

On x86, you'll be limited about 1.5gb, regardless of how much memory you have
on your system.

On x64 or IA64, the limits are much, much higher.

If you're doing "Big" calculations, and memory is a concern, just run under
x64 and make sure your code is set to "Any CPU". This does mean you need x64
hardware, and a 64-bit O/S installed (XP x64, Vista x64, Windows 2003 x64)."


Sunday, August 12, 2012

Out of Memory - Permanent Generation Space

The permanent generation is the area of heap that holds all the reflective data of the virtual machine itself, such as class and method objects  +  it also holds internalized strings.

 If an application loads “too many” classes or uses “too many” internalized strings, then it is possible it will abort with an OutOfMemoryError. The specific error is Exception in thread XXXX java.lang.OutOfMemoryError: PermGen space.

A way to find out whats occupying the PermGen space is using a simple command ->  jmap -permstat

Internalized strings can occupy space due to wrong usage of intern() in the code. You can find more details about this here - http://www.codeinstructions.com/2009/01/busting-javalangstringintern-myths.html

Refer URL for more details -
http://www.oracle.com/technetwork/java/javase/tooldescr-136044.html#gblmm

Wednesday, July 25, 2012

A handy utility to plot vmstat logs

Came across a handy utility to plot vmstat logs - vmplot.sh
Refer -
http://www.bigdbahead.com/?p=302
http://ronaldbradford.com/blog/extending-vmplot-2009-03-31/

The graphs it produces are the following:
  • CPU: graphs idle, user, sys and wait time
  • CS: graphs the context switches/s
  • IN: graphs the interrupts/s
  • IO: graphs the block read and written to disk per second
  • Memory: graphs the memory usage (swapped, free, buffers, cache)
  • Procs: graphs the number of running and blocked processes
  • Swap: graphs swap in and swap out activity

Wednesday, June 13, 2012

Separation of concerns

I came across a scenario where the login functionality is in the home page itself. It worked fine for 3+ years in production until we came across an outage due to issues in code that also got the home page down.

Now, the login page would be separated from the home page, and the login page would encourage user to bookmark it. This way the bandwidth requirement would also come down as the home page would not be accessed by regular users over time. 

Friday, June 8, 2012

Issue with generating random numbers on Linux using SecureRandom

I came across a sceanario where due to load, several Application servers had thread built up and finally crashed. Thread dumps revealed that many threads were stuck on following stack trace

ajp-10.10.48.19-8009-238" daemon prio=10 tid=0x00000000537f7000 nid=0x51c waiting for monitor entry [0x000000004547e000..0x000000004547fc00]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:308)
    - waiting to lock <0x00002aaaf2868f98> (a java.io.BufferedInputStream)
    at sun.security.provider.SeedGenerator$URLSeedGenerator.getSeedByte(SeedGenerator.java:453)
    at sun.security.provider.SeedGenerator.getSeedBytes(SeedGenerator.java:123)
    at sun.security.provider.SeedGenerator.generateSeed(SeedGenerator.java:118)
    at sun.security.provider.SecureRandom.engineGenerateSeed(SecureRandom.java:114)
    at java.security.SecureRandom.generateSeed(SecureRandom.java:495)

We have had the similar symptoms as NIA. Under load the API to generate secure random number would take time. Random number generation is based on keyboard/ mouse activity. These are missing in servers. Hence the API takes time to generate a distinct random number. The API completes quickly when there is no load. But when there are many threads requesting for a random number, the API slows down.

To disable the API from generating a perfect random number and to make it complete fast we set a new Java following property in Java Options.

java.security.egd=file:/dev/./urandom

The 90th percentile transaction response time went down from 15.1 sec to 1.6 sec by just setting this property for a load of 50 concurrent users.

Here is the complete set of JAVA Parameters:

-Djboss.jvmroute.name=f2m56g2 -Dprogram.name=run.sh -server -javaagent:/AppData/AppDynamics/AppServerAgent/javaagent.jar=uniqueID=f2m56 -Xms4096m -Xmx4096m -XX:NewRatio=3 -Xss256k -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:+UseAdaptiveSizePolicy -Dsun.rmi.dgc.client.gcInterval=3600000 -Djboss.partition.name=f2m56 -Dsun.rmi.dgc.server.gcInterval=3600000 -XX:+Disable ExplicitGC -XX:PermSize=384m -XX:MaxPermSize=384m -XX:+UseTLAB -Xloggc:ace2_gc.log -Dorg.jboss.logging.Log4jService.catchSystemOut=false -Djava.util.Arrays.useLegacyMergeSort=true -Djava.security.egd=file:/dev/./urandom -Djava.net.preferIPv4Stack=true -Djava.library.path=/jboss-4.2.3.GA/bin/native -Djava.endorsed.dirs=/jboss-4.2.3.GA/lib/endorsed -classpath /jboss-4.2.3.GA/bin/run.jar:/usr/java/jdk1.7.0_55/lib/tools.jar org.jboss.Main -Dorg.jboss.logging.Log4jService.catchSystemOut=false -Dcfg.system.property=/project/GEMSSmbImpl/config/env/common,/project/GEMSSmbImpl/config/env/GEMSCommonEnv.xml,/project/GEMSSmbImpl/config/env/GEMSOnlineEnv.xml,/project/GEMSSmbImpl/config/env/GEMSOnlineCfg.xml,/project/GEMSSmbImpl/config/Gems.properties


This is our env.:
OS version: Red Hat Enterprise Linux Server release 6.4 (Santiago)
JBoss version: jboss-4.2.3.GA
JDK version: jdk1.7.0_55

When you go to Linux prompt, you can actually list down the device /dev/urandom. You may have to find an equivalent urandom path, if your Linux version is different.

NOTE
One funny thing. We had to use file:/dev/./urandom. It would not work if we just set the property to /dev/random.


Java was waiting on input to the secure number generator code. On Linux, /dev/random is a “blocking” number generator meaning if it doesn’t have enough random data to provide, it will simply wait until it does. We can get more background information on /dev/random here - http://en.wikipedia.org/wiki//dev/random. Keyboard and mouse input as well as disk activity can generate the randomness, or entropy, needed but perhaps not fast enough for particular applications. A lack of random data would force the JVM to wait, for eternity if necessary.

Reference links -
http://tech.top21.de/techblog/20100317-why-java-s-random-number-generation-might-block-my-application-under-linux.html;jsessionid=B76AA3FB494E7E50B732CEB498B61668

http://www.ghidinelli.com/2011/01/11/cold-fusion-slow-start-messagebrokerservlet

Thursday, April 26, 2012

Handy Jmap command options

Heap Space

jmap -heap pid
- To get Heap details. Here look at the section named - "PS Old generation" -  if its above 90% then fire the next two commands in that order

jmap -histo pid
For each Java class, number of objects, memory size in bytes, and fully qualified class names are printed. VM internal class names are printed with '*' prefix. This helps to find who is occupying the heap.
If If the live suboption is specified, only live objects are counted
Note - jmap –histo:live triggers a FULL GC as compared to jmap -histo

jmap -dump:live,format=b,file=file name pid
jmap -dump:format=b,file=file name pid

The dump can be equal to the size of the heap set. For e.g. 3 GB hence ensure that much disk space is available.  One of the ways to view the dump is using a IBM heap dump analyzer which shows top leak suspects. Its been observed that the memory required to open the dump is usually at least twice as the size of the dump.

Permanent Generation Space

jmap –permstat pid

Thursday, April 12, 2012

Anomaly Detection - Fault Tolerance - Anticipation Patterns

Sharing the four cornerstones of anomaly detection, fault tolerance and anticipation pattern. I came across this in one of the many presentations floating on the web.

1. Anticipation - Knowing What To Expect

- During choice of technologies
- During design of monitoring and metrics
- During design of architecture

2. Monitoring - Knowing What To Look For
3. Response - Knowing What To Do
4. Learning -Knowing What Has Happened

Areas of redundancy -

Spatial (server, network, process)
Temporal (checkpoint, “rollback”)
Informational (data in N locations)

Wednesday, April 4, 2012

Detecting .NET application memory leaks

I came across an article that talks about a utility named debugdiag that helps to take a heap dump of a .NET application and view the dump in a html format.

www.dotnetfunda.com/articles/article625-best-practices-no-5-detecting-net-application-memory-leaks.aspx

Also, found a way to test a desktop based .NET application using AutoIT and turning ON the Perfmon logs for monitoring key performance counters like CPU, Memory, Private bytes of process, .NET gen0, gen1 and gen2 space and collections.

Wednesday, February 29, 2012

Why invest in a Performance Engineering team?

Connecting Performance Engineering to Business value
  1. Support business objective (growth), drive revenue
  2. Manage Risk: Scalability, stability (fear of the unknown)
  3. Increase efficiency: supporting same workload with fewer resources & additional spending
  4. Reduce operational costs an production outages

Monday, January 16, 2012

A good article on Data Architecture.

I have been working on tuning two slow reports and learnt that Data Architecture needs great attention especially at the project inception. Searching for more on it I came across a link which mentions that Data Architecture is across four prime areas - Organize, Store, Access & Move. In my scenario I found that some reports are slow due to less focus on Store say for e.g. partitioning not done or in other reports Organize as data growth is not considered and hence not designed for scalability/Performance

http://www.information-management.com/issues/20020301/4801-1.html

Sunday, January 8, 2012

Tips to perform RCA of a production crash

1) Concentrate

Concentration arises when the intellect directs the mind to the present action without meandering into unproductive channels of the past and future.

2) Be consistent

Consistency is when the intellect directs all thoughts and activities towards a goal.

3) Do not assume, expect the unexpected (Keep your mind open)


In bulk insert/update scenario, a developer would have committed at regular intervals and not after every query or at the end of a big transaction.

4) Analyze log

Check different logs for any traces,clues, etc. Many a times we may not be able to find root cause from a log. But, after analyzing several logs due repeatability we may get a a trend which can help find the root cause.

5) Some issues have patterns and trends and some do not - Monitor all layers, tiers, etc. Correlate. Generally its easy to find root cause if there are trends

6) Some times it makes sense not to change anything. If you change, create a history of changes for analysis - We may require to revert the changes.

7) Talk to the developers - they know or have sense of whats the problem or its cause. 

8) Failures are rich in learning - Must do post mortem or root cause analysis. Need to know how to prevent the following time. Publish a post mortem document - Identify what happened, how, when, what can be done to prevent similar problems in future, etc.

9) Keep calm and be human (Be blameless)
There are two ways of reacting when an outage is due to a mistake done by humans. One is obvious :)

Another - Don't get angry at people on outages. Else people may hide problems, people stop communicating, discourages transparency, small problems get ignored, turned in to big problems

If people are afraid to speak up, you extend outages ! 

10) Share it - Practice deliberate sharing. If it matters, share it, you never know who will benefit…and it could be you.