Sakai Performance Issues

Tracking known Sakai performance issues that we need need to address

sakai_2-5-x

sakai_2-4-x

  • Lance's response to "UC Berkeley in crisis"
    • There has been a lot of excellent input on this issue.  I would add the following comments:
      
      1) Do not discount the possibility of SAK-8932 as Stephen Marquard suggested.  This issue is not limited to Chat but
      can be triggered by any JSF application AFAIK.  The stuck threads could lead to the kind of request backlogs you are
      reporting.  We have only seen this crop up once or twice, but it does seem to be load related so you may be seeing it
      in your environment.
      
      2) Hardware is cheap - Since we have upgraded to eight servers with 10GB heaps (i.e. total 80GB heap), Sakai is
      behaving *MUCH* better under load.  Our hardware change included moving from 32bit to 64bit OS, cutting the number of
      app servers in half (i.e. 16 --> 8), keeping the total number of CPUs in the cluster at 32, and going from 32GB total
      heap to 80GB.
      
      3) Are you seeing any OutOfMemoryExceptions?  Before our 64bit upgrade, we were seeing 10 - 15 of these a day.  Since
      the upgrade, I have not seen one OOM error.
      
      4) Turning off quotas does significantly reduce the amount of XML DOM parsing you will do, but it was not a major
      contributing factor to our stability.
      
      Let us know what we can do to help... L
      
  • Email from Lance Speelmon
    • https://oncourse.iu.edu/access/wiki/site/3001b886-1069-4fb7-00d5-8db4b3a85f74/home.html
    • Adi,
      
      Let me see if I can outline the changes:
      
      1) DBCP settings we have been running for 2+ years:
      minSize=10
      initialSize=10
      maxSize=50
      
      2) When we started seeing DBCP having problems establishing new database connections, we switched to:
      minSize=50
      initialSize=50
      maxSize=50
      * These settings served us pretty well until we saw the 2x load increase the first week of classes.
      
      3) Once the load really hit we tried:
      minSize=150
      initialSize=150
      maxSize=150
      * We were still seeing errors with creating new database connections and DBCP deadlocks.
      
      4) Our current settings after switching to c3p0:
      minSize=150
      initialSize=150
      maxSize=150
      * We still saw connection errors, but c3p0 was able to cope without any deadlocking.
      
      5) Now that we think we have resolved our Oracle connection issues, we are considering moving to the following settings for c3p0:
      minSize=10
      initialSize=10
      maxSize=150
      * The change that we think resolved the Oracle connection issues were increasing number of dispatchers, and disabling automatic memory management.
      
      Thanks, L
      
    • Do you have minIdle and maxIdle set? and does maxIdle = maxActive? That will
      ensure you don't create new db connections and will help you scale much
      better.
      
      We have 8 appservers and use:
      
      minIdle@javax.sql.BaseDataSource=1
      maxIdle@javax.sql.BaseDataSource=14
      initialSize@javax.sql.BaseDataSource=15
      maxActive@javax.sql.BaseDataSource=14
      
      with 400 requests per second peak, I'm don't see why you would
      need 2400 db pool connections -- maybe 400 * 2 for safety, but you are just
      eating PGA unnecessarily with all those connections, and that memory could be
      used for SGA instead (we reduced our PGA from 512m to 256m and haven't seen
      problems).
      
      Adi
      
  • Email exchange with R. P. Aditya : aditya@grot.org
    • On Fri, Aug 31, 2007 at 11:50:34AM -0700, Thomas Amsler wrote:
      > > Are the 15 connections in the DBCP connection pool you max setting? I think 
      > > the default is max=50 in OOTB.
      
      on our 8 appservers, we use:
      
      minIdle@javax.sql.BaseDataSource=1
      maxIdle@javax.sql.BaseDataSource=14
      initialSize@javax.sql.BaseDataSource=15
      maxActive@javax.sql.BaseDataSource=14
      
      and in typical use, even at peak, we only see 2-3 active via Oracle
      
      The most important thing for Oracle is that maxIdle = maxActive so that the
      pool connections are never dropped or recycled since setting up new
      connections is terribly expensive...
      
      Adi
      
    • Adi,
      
      Would you mind sharing your Oracle memory settings?  We are currently running with:
      db_cache_size = 4096M (from 5120M)
      shared_pool_size = 3072M (from 4096M)
      java_pool_size = 250M (no change)
      large_pool_size= 2048M (from 4096M)
      sga_max_size = 20480M (from 24576M)
      
      Thanks, L 
      
    • Hi Lance,
      
      We are using automatic shared memory management in Oracle. Base on your settings and our ours, I think are the keys are as the follow.
      1. Your shared_pool is too large.  SAKAI application codes does not need such a large shared pool. 1 gig can be a good start point (unless you have other applications in the same database).
      2. You can set the db_cache_size much greater. We have a total SGA of 6560M and 5872M is used in buffer cache (Oracle assigned it).
      3. 255M of PGA is enough based on our settings.
      4. If you can set the sga_max_size = 20480M (or even higher as your from), try to use AMM and set the sga_target to at least 18gigs.
      
      The following are our parameter settings:
      sga_max_size =6560M
      sga_target=6560M
      pga_aggregate_target=256M
      
      The following are automatically generated by Oracle based our target:
      Shared Pool     624M
      Buffer Cache     5872M
      Large Pool     16M
      Java Pool     32M
      Other     16M
      
      Luke has created a site to put the parameter as a reference:
      
      http://confluence.sakaiproject.org/confluence/display/ENC/Oracle+Admini
      
      All the parameters can be seen below.
      
      Thanks,
      Drew Zhu
      Oracle DBA
      ITCS, University of Michigan 
      
    • What is the "cursor_sharing" parameter? Setting it FORCE or SIMILAR will force the sharing of similar SQLs and may help in reducing the shared_pool_size.  We use force as you can see in the parameter file.  Also, if you are using more tools than we use, it should be larger.
      
      Thanks,
      Drew 
      
  • SAK-9860 : Excessive db queries generated from Site Info / user service
    • From Ian:
      Just commited a fix agaisnt SAK-9860
      
      Its not a total fix, but you should be able to patch 2.4.x (once fully tested) and the profiler is saying the 
      number of queries for a since request is now 1 rather than 4 the first time per user and then 0 after that.
      
      Needs testing though, and only eliminates the EID/ID SQL.
      
  • SAK-11279 : Spurious presence events
    • From Stephen:
      Hi all,
      
      If you're running 2.4.0 or 2-4-x in production with presence enabled, you will probably want to apply the fix to presence/courier from:
      
      http://jira.sakaiproject.org/jira/browse/SAK-11279
      
      This is a bug that logs 2 presence events every time a presence refresh is made (every 30s per user). Fixing this reduced the volume of presence events in our production system by a factor of 10 or more.
      
      Regards
      Stephen