IRC Chat 12162009

 

Irc 12162009 chat with rSmart 

 

#ucdsakai

09:29

INFO

Channel view for “#ucdsakai� opened.

09:29

=-=

User mode for smavocet is now +i

09:29

-->|

YOU (smavocet) have joined #ucdsakai

09:30

<kdalex>

yes, they look okay to me too.

09:31

-->|

markpankow (n=chatzill@72.44.192.164) has joined #ucdsakai

09:31

<mwenk>

If its the problem we had, and IIRC the problem was the cost based optimizer wasn't using the index

09:31

<mwenk>

because the stats on the table were whacked

09:32

<kdalex>

yes, we should have Mark look at that as well as give us the load average on the db machine

09:32

-->|

tamsler (n=chatzill@69-12-224-247.dsl.static.sonic.net) has joined #ucdsakai

09:33

|<--

tamsler has left freenode (Remote closed the connection)

09:33

-->|

tamsler (n=chatzill@69-12-224-247.dsl.static.sonic.net) has joined #ucdsakai

09:34

<tamsler>

sorry for the dealay

09:34

<tamsler>

I just replied to John's email

09:34

<tamsler>

here is the reply:

09:34

<tamsler>

Assignment definitely makes calls to CM. We added the OSIVF to that

09:34

<tamsler>

tool. I think that's one of the tools that you guys have to retrofit

09:35

<tamsler>

with osiv.properties

09:35

<mwenk>

Isn't assignments under OSIV now?

09:35

<kdalex>

how many other places did we do this?

09:35

<tamsler>

checking ..

09:36

<tamsler>

I am looking at https://svn.ucdavis.edu/svn/ucd-sakai/patches/branches/sakai_2-5-x-prod/patch-list.txt

09:36

<mwenk>

web ?

09:36

<mwenk>

it uses section awareness?

09:36

<mwenk>

wow

09:37

<tamsler>

roster, assignment, msgcntr, sections, chat, mailtool, web, gradebook, sections

09:38

<tamsler>

is somebody analyzing the catalina.out w.r.t. osivf errors

09:39

<tamsler>

I still don't have access to their system

09:39

<tamsler>

because 2.6 may have other tools that use cm

09:39

<tamsler>

like OSP etc

09:40

<mwenk>

I can copy those somewhere you can look at em

09:40

<mwenk>

lemme do that now

09:40

-->|

Prabhu (n=prabhuem@psl-87.ucdavis.edu) has joined #ucdsakai

09:41

<tamsler>

brb

09:41

<Prabhu>

Mark, When was the last table switch(ta/tb) ran. And was the successful?

09:42

<kdalex>

John Bush...are they any other URL's that are failing?

09:42

<kdalex>

or hanging in the load test?

09:42

<jbush>

that was the worst

09:43

<jbush>

once sec, I'm not sure I saved the report

09:43

<markpankow>

I just sent the group the explain plans for the queries Prabhu sent. Full table scans on coursemanagement_section.

09:44

<smavocet>

Statistics problem like we had, Prabhu?

09:45

<kdalex>

any idea from anyone why we didn't see this on last weeks load test?

09:46

<Prabhu>

i need to know when and how the last table switch script ran

09:46

<Prabhu>

Mark?

09:46

<mwenk>

Thomas: rsmart catalina files copied to sakai@caje:/ucd/opt/sakai/rsmart_logs

09:50

-->|

kzinti34 (n=chatzill@psl-194.ucdavis.edu) has joined #ucdsakai

09:50

<tamsler>

thanx mike

09:52

<kdalex>

Prabhu, what will the time of the table switch tell you...about stats? Can we see if the stats are in place from here?

09:53

<tamsler>

looks like auto.ddl is enabled not that this is a problem per see but we never did that

09:53

<kdalex>

number of oracle connections is dropping but performance is still slow

09:53

<jbush>

its really bad

09:54

<jbush>

I'm not running any tests over here

09:55

<mwenk>

Probably the queries from the test havent' returned yet

09:55

<mwenk>

We saw that too

09:55

<mwenk>

sessions would hang until we killed the app

09:55

<smavocet>

Is Mark still on IRC? Prabhu is looking for him.

09:55

<mwenk>

sessions as in DB sessions

09:55

<rsmartmike>

thomas, i have copied your pubkey to the server. you will be able to log in now. the other key was messed up and had the incorrect permissions...so that's all fixed.

09:56

<tamsler>

I wonder if the assignment tables are ok in the internal DB

09:56

<tamsler>

lots of errors in catalina.out

09:57

<rsmartmike>

sakai_external is the proper schema name...

09:58

<Prabhu>

OK

09:58

<rsmartmike>

we use 'sakai' and 'sakai_external'

09:58

<Prabhu>

thanks

09:59

<markpankow>

I just replied to your email prabhu. Sorry, I was on the phone with Dave.

09:59

<markpankow>

sakaiext in the pilot. Sakai_external here.

09:59

<tamsler>

I am still not able to ssh tpamsler@smartsite-test.rsmart.com

09:59

<mwenk>

try tamsler@

10:00

<kdalex>

i tried running the query you sent, Prabhu, from TOAD but I get

10:00

<kdalex>

ORA-02404: specified plan table not found

10:00

<jbush>

do we want to recycle the app servers ?

10:00

<rsmartmike>

yes, your userid is 'tamsler'

10:01

<Prabhu>

Kirk, we will not be able to run these quries from our end...we need dba priv on the server itself

10:02

<rsmartmike>

oracle server load is still high--do we know why if there are no load tests running?

10:03

<Prabhu>

Mark, I have sent drop stats statements. Please run them and repeat the autotrace

10:03

<kdalex>

I show valid stats on course_management_ta with last date = 12/16/2009 5:02:54 AM

10:04

<mwenk>

jbush, your assignment check relied on an preexisting assignment being there?

10:04

<mwenk>

or did you create an assignment for the test?

10:04

<kdalex>

db sessions have just dropped to 3 from 400

10:05

<smavocet>

Prabhu sent him delete stats from various tables.

10:05

<mwenk>

did you guys cycle the app servers?

10:05

<smavocet>

sent Mark

10:05

<rsmartmike>

oracle server load plummeted

10:05

<rsmartmike>

have not cycled tomcat...i can do that now, just give the word.

10:06

<kdalex>

dont think they recycled the servers, I didn't loose my session

10:06

<mwenk>

cool

10:06

<mwenk>

did mark kill the stats then?

10:06

<smavocet>

So once Mark runs the delete stats from several tables, we'll start the load test again?

10:07

<jbush>

much better now

10:07

<rsmartmike>

so, cycle tomcat or no/

10:07

<Prabhu>

:) Lets hear from Mark first

10:07

<jbush>

I don't think its necessary

10:07

<mwenk>

I think the assignment NPEs are because whatever assignment it was looking for doesn't exist

10:07

<mwenk>

probably because it was created on the old data but not the new data

10:08

<Prabhu>

we need to look at the trace and explain plans, to make sure there are no full scans

10:08

<jbush>

right

10:08

<jbush>

we should have corey re-record the test

10:08

<jbush>

now that the server are responsive again he can do that

10:08

<mwenk>

I can't believe grinder cant do thinks like clicking etc

10:08

<jbush>

I was going to have him do test8, once with chat and once without chat

10:08

<mwenk>

reminds me of jmeter

10:09

<jbush>

well thats why we use selenium

10:09

<mwenk>

silk is like selenium

10:09

<mwenk>

in that way anyways

10:09

<mwenk>

well it can do both

10:09

<kdalex>

so did mark do anything to the db to get it to clear? any word on dropping the stats

10:10

<tamsler>

same with tamsler it just prompts for pw instead pass phrase

10:11

<smavocet>

Mark just sent something in email.

10:11

<Prabhu>

i got explain plans from mark...they look great

10:11

<smavocet>

Explain plan

10:12

<mwenk>

owa is being evil to me

10:12

<smavocet>

so all stats have been removed, Prabhu?

10:12

<Prabhu>

yes

10:12

<smavocet>

no more full table scans?

10:12

<Prabhu>

CM quries should go fine now

10:12

<smavocet>

so are we ready to rerun the load test again now?

10:12

<mwenk>

So then do we need to have him cron run the delete stats?

10:12

<Prabhu>

yea...i will send him details

10:13

<smavocet>

Load test?

10:13

<Prabhu>

they may have to adjust auto stats windows too

10:13

<jbush>

corey is re-recording tests, then I have to adjust manually adjust to get the users in there, will probably take 15-20 minutes or so

10:13

<smavocet>

So stats run only at a particular time?

10:13

<smavocet>

Prabhu?

10:14

<smavocet>

ok, wrt jbush

10:14

<smavocet>

Didn't Mark say he had stats running hourly?

10:14

<mwenk>

Once you do that jbush, please commit em to svn

10:16

<kdalex>

I didn't get Mark's latest email. Can someone forward to the list, please?

10:16

<Prabhu>

yea...we can define the windows based on our work load.

10:17

<smavocet>

Done wrt Kirk.

10:18

<kdalex>

Doe we need a phone call between Prabhu and Mark and others to make sure we get this db configuration and stats issue correct? This is absolutely critical!

10:19

<rsmartmike>

thomas, the authorized_keys file was owned by root...i changed that, but i don't think that will make a difference.

10:19

<rsmartmike>

not sure why your key's not workin.

10:19

<tamsler>

did you get it from my last email?

10:19

<rsmartmike>

yes

10:20

<tamsler>

hmm I am using that same key just fine on my systems

10:20

<tamsler>

have you added all the IPs that I have specified in that email?

10:20

<smavocet>

wrt kirk, Mark and Prabhu can call each other anytime. They have each other's cell numbers.

10:20

<tamsler>

or is ssh wide open?

10:21

<kdalex>

no ssh needs to be from the ip you gave them originally

10:21

<rsmartmike>

not sure about the ip's...i will double check

10:21

<tamsler>

I know, I specified my home IP and it is blocked

10:21

<rsmartmike>

dave may not have acted on that one, i will ping him right now

10:21

<Prabhu>

Kirk, I will explain about stats on an e-mail to mark and cc to you all

10:22

<mwenk>

Alright, since it will take a cpl for the load test, gonna use this time to get into PSL

10:22

<kdalex>

Thanks, Prabhu

10:22

<mwenk>

I'll be able to start the load test once I'm in

10:22

<mwenk>

that ok with kirk/sandra?

10:23

<kdalex>

Let's get you in here first...

10:23

<--|

mwenk has left #ucdsakai

10:24

-->|

littlelucca (n=joncarlo@bly.ucdavis.edu) has joined #ucdsakai

10:25

<tamsler>

here are the IPs just in case 169.237.11.246

10:26

<tamsler>

169.237.11.247

10:26

<tamsler>

69.12.224.247

10:32

<tamsler>

Ok I was able to login now

10:32

<tamsler>

but only from the 169.237.11.246 ip and not the other two

10:38

<kdalex>

Prabhu, when can we expect your email?

10:40

<jbush>

ok tests committed, I haven't started test yet to make sure they work

10:41

<kdalex>

ok. guess you can run a short one to test and let us know

10:41

<jbush>

there are 2 news tests, test_8_with_chat.py and test_8_without_chat.py

10:41

<smavocet>

I'll let QA know.

10:42

<smavocet>

Starting when?

10:42

<kdalex>

lets start w/o chat. can we have mark watch...start when we know he is ready

10:42

<jbush>

running short test of first one now, just to make sure it works

10:42

<jbush>

ok

10:42

<smavocet>

Mark's not on irc

10:42

<smavocet>

Not sure he's watching. Is he?

10:43

<kdalex>

yes he is...

10:43

<smavocet>

He's there but not responding here.

10:43

<smavocet>

Responding only in email that I can tell.

10:44

<markpankow>

I am here.

10:44

<markpankow>

I am ready...

10:45

<smavocet>

ok, great. Hadn't seen input from you in some time.

10:45

<smavocet>

I'll let QA know. We're starting now, right?

10:46

<kdalex>

23 sessions now in the db again...showing cm queries. I cannot tell if there are full table scans

10:46

<jbush>

50 user test in process

10:46

<markpankow>

Sorry. I just had 4 other windows open looking at other issues.

10:46

<kdalex>

got this

10:46

<kdalex>

Table Scan: SAKAI_EXTERNAL.COURSEMANAGEMENT_SECTION_TA: 45792 out of 45792 Blocks done

10:47

<smavocet>

What does that mean?

10:49

<smavocet>

so far so good wrt QA and response time as you might expect with such a low load.

10:52

<jbush>

well, its almost done

10:52

<jbush>

looks like assignment is still a problem

10:52

<jbush>

https://smartsite-test.rsmart.com:8081//xsl-portal/tool/f3275f0a-74ed-467b-8cbd-42354ddbff11

10:52

<jbush>

that url is taking on average 179 seconds

10:53

<kdalex>

W

10:53

<jbush>

I'll killing the test

10:53

<jbush>

what do what want to do at this point, seems like there is still a problem

10:53

<kdalex>

there are 2 consecutive slashes in that URL...is that ok?

10:55

<jbush>

we need some ideas

10:55

<jbush>

seems like to big things have changed here, new code, and new database

10:55

<jbush>

we could roll back to old tag and test that way

10:56

<kdalex>

We need input from Mark first...

10:56

<jbush>

ok

10:57

<kdalex>

To the UCD programmers....any more on the interrelationship between the URL above and CM?

10:57

<kdalex>

I just got a timeout so I cannot make that URL work for me

10:58

<tamsler>

checking ...

10:58

<markpankow>

837

10:58

<markpankow>

sorry... wrong window

10:58

<kdalex>

Also, what is the meaning of the 8081 port? I just tried it without the port and it came back

10:58

<tamsler>

I can access it just fine ..??

10:59

<jbush>

oh mike exposed 8081-8086 so you can hit tomcat directly

10:59

<jbush>

makes it easier when looking at logs

10:59

<kdalex>

What is the 837?

10:59

<kdalex>

Oh, great! That's good to know

10:59

<markpankow>

nothing... I typed that in the wrong window

10:59

<kdalex>

Well the URL works okay w/o the port...and I can get into Grade mode on the tool which accesses cm

11:00

<tamsler>

what's strange is that every other time I try to access https://smartsite-test.rsmart.com:8081//xsl-portal/tool/f3275f0a-74ed-467b-8cbd-42354ddbff11 it doesn't work

11:00

<kdalex>

yeah, it's hanging for me too....do we have to login first to that port to get a valid session on that server?

11:00

<jbush>

yes

11:01

<rsmartmike>

ports 8081-8086 should be accessible via those 30 or so IPs you sent

11:01

<rsmartmike>

if you want, i can paste them in here

11:02

<jbush>

 

11:03

<kdalex>

oh, that's my problem...I am trying from home.

11:04

<kdalex>

Mark - any word on stats or current table scans? Looks like they are still happening though w/o DBA permissions I can't be sure

11:05

<kdalex>

DB load dropped to 0 again

11:05

<markpankow>

I am sending a report from the 11:25 to 11:45 timeframe. Things that looked out of whack before are looking better, but it takes me a bit to read through it.

11:05

<markpankow>

I am sending the email now.

11:07

<kdalex>

URL works from my fixed ip

11:07

<jbush>

yeah seems ok now for me too

11:09

<kdalex>

should we start the load again then and have Mark start both a report and do some realtime checking?

11:09

<kdalex>

Also, could he re-run Prabhu's query now while its quiet to see if we still show full table scan in the explain plan?

11:09

<smavocet>

That url resolved for me once but the second time, it couldn't access it.

11:10

<smavocet>

I'm on my fixed IP address also.

11:10

<jbush>

how many users do we want to do ?

11:11

<kdalex>

worked for me.

11:12

<jbush>

how about 1000 users ?

11:12

<smavocet>

Try it again, using a different browser, Kirk.

11:12

<kdalex>

If you switch browsers you have to relogin to the specific port (e.g. 8081)

11:12

<kdalex>

1000 seems ok. should we limit to 10 mins or so?

11:13

<jbush>

ok, 1000 users for 10 mins, here goes

11:13

<kdalex>

Mark, can you confirm you are monitoring?

11:15

<markpankow>

Okay, I am watching. Give me 10 seconds

11:15

<markpankow>

Olau gp

11:15

<markpankow>

okay go

11:15

<jbush>

about 250 users in so far

11:16

<markpankow>

I have it set to get a before snapshot and another in 10 minutes.

11:16

<jbush>

the assignment url looks fine so far

11:17

<jbush>

about 500 users now

11:17

<kdalex>

about 200 db connections

11:18

<kdalex>

so far so good

11:18

<jbush>

750 users

11:19

<Prabhu>

Whats the DB server load?

11:20

<markpankow>

24

11:20

<kdalex>

trying to load nut 011 001-010 FQ 2009 - still waiting

11:20

<smavocet>

24 is a bit high, not so?

11:21

<Prabhu>

That is too high......is the load increasing with increase in users?

11:21

<jbush>

all 1000 are in now

11:21

<markpankow>

load is still at ~24

11:22

<markpankow>

it does appear to increase with added users.

11:22

<kdalex>

Can anybody else get to a Site Info in a course site?

11:22

<kdalex>

576 database connections

11:22

<jbush>

try pulling up assignment tool in SPA 001 001-0011 site, crazy slow again

11:23

<kdalex>

yes, unfortunately. If Mark's report doesnt' show full table scans then we must have ruined it with the realm refreshes????

11:24

<jbush>

this is concerning

11:24

<jbush>

2009-12-16 11:23:01,076 INFO ajp-10.10.12.20-8009-357 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955

11:24

<jbush>

2009-12-16 11:23:02,193 INFO ajp-10.10.12.20-8009-360 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955

11:24

<jbush>

2009-12-16 11:23:02,554 INFO ajp-10.10.12.20-8009-359 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/319258e7-c2ee-4eab-80c8-65365458c345

11:24

<jbush>

2009-12-16 11:23:03,161 INFO ajp-10.10.12.20-8009-361 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955

11:24

<jbush>

2009-12-16 11:23:05,609 INFO ajp-10.10.12.20-8009-362 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955

11:24

<jbush>

2009-12-16 11:23:06,381 INFO ajp-10.10.12.20-8009-363 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955

11:25

<smavocet>

Kirk, I can get to site info but it takes a long time.

11:26

<smavocet>

So jbush, the authz refresh is too frequent? Thomas?

11:26

<kdalex>

not me. did su to liz applegate and cannot visit her course sites at all

11:28

<smavocet>

spoke too soon. That was section info. Site info hasn't returned yet.

11:28

<jbush>

ah, they logs are cluttered with refreshes

11:29

<jbush>

its a refresh attack

11:29

<kdalex>

Thomas can you look at these logs too, please?

11:29

<jbush>

something is not right with this code, it shouldn't be refreshing like this, its multiple times a second, different threads

11:30

<jbush>

I'm looking at ironmaiden, ucdavis1

11:30

<kdalex>

okay, good find. Thomas and Mike can you look at the code again. Do the logs show where in the code?

11:31

<smavocet>

Mike hasn't made it in yet. On his way.

11:31

<smavocet>

I'll share this script with him though when he gets in.

11:34

<rsmartmike>

well, at least we'll get up to date info in the site roster ;)

11:35

<kdalex>

Yeah, to the nanosecond!

11:35

<kdalex>

Glad you can keep a sense of humor while we're all pestering you in realtime!

11:39

<smavocet>

Are you stopping the load test now or keep it going for continued diagnostics?

11:39

<jbush>

I'm stopping it

11:40

<smavocet>

ok

11:40

<kdalex>

I just got Internal Server Error

11:40

<kdalex>

The server encountered an internal error or misconfiguration and was unable to complete your request.

11:40

<kdalex>

Please contact the server administrator, support@rsmart.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.

11:40

<kdalex>

while trying to get to site info on a big site

11:40

<jbush>

this is what I think we should do

11:41

<jbush>

modify the refresh log to output the stack, so we can see where its coming from, make this change on one server, stop all the server and run with only the one server, try a load of 250

11:41

<kdalex>

okay with me. Thomas, Mike: any comments?

11:42

<smavocet>

Mike's on his way into Chiles Road.

11:42

<smavocet>

Sounds good to me also but I'm no substitute for them.

11:43

<tamsler>

the logs show that refreshAuthzGroupIfNecessary is called way to many times

11:44

<kdalex>

can the code be made to check to not do it more than once in a session for a site or something to that effect?

11:44

<jbush>

yes

11:44

<jbush>

use a thread local or something

11:45

<tamsler>

or use a cache indicating when the last refresh was done

11:45

<tamsler>

similar we did before but without the timer

11:45

<jbush>

well that is what its supposed to be doing right now

11:45

<tamsler>

and then use the caches ttl to expire

11:47

-->|

wenk (n=mwenk@psl-242.ucdavis.edu) has joined #ucdsakai

11:47

<smavocet>

Hi Mike. I saved the irc script to update you.

11:48

<wenk>

Ok

 

IRCchat12162009 with rSmart

(missing from 11:48 to 13:05)

 

 

 

 

#ucdsakai

13:05

<wenk>

but then I'd wonder why it works on our 2.5 impl

13:06

<jbush>

synchronized (azGroup) {

13:06

<jbush>

M_log.info("refreshAuthzGroupIfNecessary(): refreshing " + azGroup.getId());

13:06

<jbush>

if (M_log.isDebugEnabled())

Unknown macro: { || 13}

13:06

<jbush>

refreshAuthzGroup(azGroup);

13:06

<jbush>

Long timeoutValue = new Long(new Date().getTime() + (getRefreshTimeoutSeconds() * 1000L));

13:06

<jbush>

realmsRefreshed.put(azGroup.getId(), timeoutValue);

13:06

<jbush>

M_log.debug("refreshAuthzGroupIfNecessary() group: " + azGroup.getId() + ", new timeout=" +timeoutValue);

13:06

<jbush>

}

13:06

<jbush>

I'm not sure all the azGroup objects are the same to all threads

13:08

<jbush>

I'm not sure this is even the problem

13:08

<jbush>

that 7 refresh calls in a 1/10 of sec

13:09

<wenk>

dunno either

13:09

<wenk>

but considering that down in that refresh code it is calling out CM and is wanting sections

13:09

<wenk>

its definitely not helping

13:09

<kdalex>

211 db connections but no long ops

13:10

<kdalex>

now only 5 db connections...clears fast

13:10

<wenk>

Look, all the reading I've done in the past(there was a discussion on it on sakai-dev a while back) about CHM vs the collections api wrapper says we should be using CHM

13:10

<wenk>

not the wrapper

13:11

<jbush>

ok well thats easy to change

13:12

<wenk>

Not sure if it will fix it

13:12

<jbush>

so like this? realmsRefreshed = new ConcurrentHashMap(size);

13:14

<wenk>

yu

13:14

<wenk>

p

13:14

<wenk>

CHM is supposed to scale better

13:14

<wenk>

so lets see if that's true

13:14

<wenk>

anyways, gonna grab some food, bb in 15

13:15

<jbush>

ok we have server back up with the wrapper, lets test that first

13:15

<smavocet>

Response time seem good from ux perspective.

13:16

<jbush>

starting 1000 user test now

13:17

<jbush>

brb in 5

13:19

-->|

corey1 (n=corey@internal.rsmart.com) has joined #ucdsakai

13:19

<tamsler>

why don't we init the cache in the init() method rather then in the refresh method ?

13:20

<jbush>

700 users on...

13:21

<jbush>

1000 users loaded up

13:22

<smavocet>

still looks pretty good for me.

13:25

<smavocet>

Can we let it run a little longer. QA was caught off guard. Thanks.

13:26

<jbush>

yep, response is really good from grinder

13:26

<jbush>

I don't see how this synchronized thing would have fixed it, did mark do something else ?

13:30

<markpankow>

I didn't do anything. I wish I could take credit.

13:30

<tamsler>

The logs still show serveral refresh entries per second

13:31

|<--

kdalex has left freenode (Read error: 110 (Connection timed out))

13:34

-->|

kdalex (n=chatzill@moobilenet-100-51.ucdavis.edu) has joined #ucdsakai

13:34

<wenk>

back

13:37

<kdalex>

me too

13:39

<tamsler>

so does the code use the concurrentHashMap now?

13:40

<jbush>

no, its using the wrapper

13:44

<jbush>

we can try this, change to use ConcurrentHashMap and the synchronize on the authgroup id, like this

13:44

<jbush>

public void refreshAuthzGroupIfNecessary(AuthzGroup azGroup){

13:44

<jbush>

if(azGroup != null && azGroup.getId().startsWith("/site/") && !azGroup.getId().contains("/group/")) {

13:44

<jbush>

if(null==realmsRefreshed){

13:44

|<--

jbush has left freenode (Excess Flood)

13:44

-->|

jbush (n=Adium@internal.rsmart.com) has joined #ucdsakai

13:45

<wenk>

So what exactly changed?

13:46

<jbush>

add this synchronized (azGroup.getId()) {

13:46

<jbush>

and switched to CHM

13:46

<wenk>

is it running with that?

13:46

<wenk>

want to use pastebin ?

13:46

<jbush>

no, its only running with the wrapper

13:46

<jbush>

but there doesn't seem to be a performance problem right now

13:46

<wenk>

I know

13:46

<jbush>

should we try to ramp up to 3000 users first

13:47

<wenk>

which makes me wonder if its not this but something else

13:47

<wenk>

Sure, can do that

13:47

<jbush>

do we want to split it between us ?

13:47

<wenk>

I'll fire off 1800 from here

13:48

<wenk>

Also, lets turn debug off for this run

13:48

<tamsler>

I think we still have an issue

13:48

<tamsler>

looking at the logs, I see several calls to refresh

13:48

<jbush>

if we are going to bounce should we then make the code mods

13:49

<tamsler>

but only the first log message shows up containing refreshing but not the one containing group

13:49

<tamsler>

this is due to the exception being thrown

13:49

<tamsler>

if we take debug out we will tank again

13:49

<jbush>

what exception ?

13:49

<jbush>

I put a stackTrace in on purpose, its not an exception if that is what you are talking about

13:50

<wenk>

yes but there's a ton of overhead

13:50

<jbush>

ah

13:50

<wenk>

If its a timing issue between the threads that will smooth it out a bit

13:51

<kdalex>

are we taking time out for the conference call in 10 min?

13:51

<jbush>

oh I see

13:51

<tamsler>

I think it's still not working as we expect it to

13:52

<tamsler>

if you do: grep refreshAuthzGroupIfNecessary catalina.out.server6 | grep -v site/~ | grep -v 1128

13:52

<tamsler>

you will see that we are getting several refreshed per second

13:53

<jbush>

then I think we should add in the azgroup blocking

13:53

<jbush>

seems like that would address it if it a timing issue

13:55

<wenk>

sync around refreshAuthzGroupIfNecessary

13:55

<jbush>

really, that seems agressive

13:55

<wenk>

wait

13:57

<wenk>

sorry looking at the timer

13:57

<wenk>

something seems odd between the old and new

13:58

<wenk>

yeah, we had that timer thread thing

13:58

<jbush>

we are joining the call now

13:59

<wenk>

will you stay on irc as well?

13:59

<kdalex>

coming

14:00

<kdalex>

room confloct coming

14:03

<kdalex>

thomas and mike...come join the call

14:04

<Prabhu>

I will stay on IRC

14:04

<kdalex>

ok

14:08

|<--

jbush has left freenode (Read error: 60 (Operation timed out))

14:11

-->|

jbush (n=Adium@internal.rsmart.com) has joined #ucdsakai

14:35

<wenk>

Back

14:35

<wenk>

Lemme know when to kick off the ucd side of the load test

14:41

|<--

jbush has left freenode ("Leaving.")

14:43

-->|

jbush (n=Adium@internal.rsmart.com) has joined #ucdsakai

14:46

<rsmartmike>

ok, app servers coming down for new tag deployment. probably be about 15-20 minutes before they can be back up.

14:48

<wenk>

okay

14:48

<wenk>

lemme know when I should start the load test from here

14:49

<rsmartmike>

roger

15:02

<smavocet>

Oh, I wasn't watching

15:02

<smavocet>

We're starting the load test now?

15:04

<--|

corey1 has left #ucdsakai

15:13

<wenk>

mike, how we doin?

15:15

<wenk>

rsmartmike: How we doin ?

15:16

|<--

kzinti34 has left freenode ("ChatZilla 0.9.86 Firefox 3.5.6/20091201220228")

15:17

<rsmartmike>

compiling...prob less than 5 minutes

15:18

<smavocet>

ok

15:19

<wenk>

ok

15:22

<rsmartmike>

starting up...

15:23

<--|

littlelucca has left #ucdsakai

15:27

<wenk>

so kick off the load test?

15:27

<jbush>

yes, started 1200 users over here

15:28

<jbush>

still ramping mine up

15:28

<wenk>

which script should I call? w/o chat?

15:29

<jbush>

that is the one I'm using

15:29

<wenk>

ok

15:32

<wenk>

starting

15:34

<jbush>

my 1200 are going, response time avg is 72 millis

15:36

<wenk>

Oops, didn't put the script in for my last node

15:36

<wenk>

only running 1200

15:36

<wenk>

last 6 will be up in a sec

15:37

<rsmartmike>

db load is < 1

15:37

<jbush>

grinder is getting timeouts now, but response time still good

15:37

<jbush>

dave is checking firewall/apache, could be there

15:37

<rsmartmike>

db is looking great so far, at least...

15:38

<jbush>

nice!

15:38

<rsmartmike>

we are seeing tons of cache hits for authz stuff

15:38

<wenk>

All 1800 up

15:38

<wenk>

Oops, somehow the props I had to ramp up were lost

15:38

<wenk>

so all 1200 and then the 600 connected at once

15:38

<wenk>

Getting errors

15:39

<rsmartmike>

authz cache size is 350-500 on the 3 servers i looked at

15:39

<wenk>

again I wish there was a ghost busting button in the Online tool

15:40

<wenk>

another min b4 we get the new catalina.out

15:40

<rsmartmike>

could we just clean up the session table? woudl that effect the same result?

15:40

<wenk>

cluster table I think

15:40

<wenk>

I'd not worry about it, its just annoyin

15:40

<jbush>

as soon as you go live we'll start working on that :)

15:41

<rsmartmike>

oracle load 1.5

15:41

<wenk>

I'm kicking off timings

15:41

<tamsler>

refreshes look really good

15:41

<tamsler>

lot's of cache hits

15:42

<tamsler>

and no multiple refreshed per realm

15:42

<wenk>

Yeah

15:43

<rsmartmike>

oracle load 3.31, not bad.

15:44

<rsmartmike>

better than 47!

15:44

<kdalex>

Now that's more like it.

15:45

<wenk>

slow performance now

15:45

<wenk>

tho honestly I dunno how much its the site and how much its my client at this point

15:46

<rsmartmike>

oracle load .23

15:46

<rsmartmike>

it is not responsive all of a sudden for me too

15:46

<tamsler>

I am not getting in anymore

15:47

<jbush>

yeah my browser got a timeout

15:47

<tamsler>

what is the server load on the two systems that run apache compared to the other nodes

15:47

<rsmartmike>

.13 & .15

15:47

<rsmartmike>

on 2 apache servers

15:48

<rsmartmike>

.35 now on one of them

15:48

<tamsler>

I wonder if we are running out of tomcat threads

15:48

<rsmartmike>

and it's back

15:48

<rsmartmike>

oracle load to 7.49

15:49

<jbush>

weird its happy again

15:49

<tamsler>

how many threads are allocated per tomcat in server.xml

15:49

<tamsler>

is it the 150 default?

15:50

<rsmartmike>

kswapd1 is active on teh oracle box

15:50

<rsmartmike>

that doesn't seem like a good ting

15:51

<rsmartmike>

apache load: 2.26/0.65

15:51

<tamsler>

I am seeing lots of the following errors:

15:51

<tamsler>

2009-12-16 15:28:22,247 ERROR ajp-10.10.12.45-8009-54 org.sakaiproject.tool.su.SuTool - SuTool Fatal Error: You must be an administrator to become another user. null

15:52

<wenk>

maybe 3000 is too many users

15:52

<wenk>

brb

15:54

<rsmartmike>

1.28gb swap in use on oracle box

15:54

<tamsler>

we are running 250 threads per tomcat instance

15:58

<tamsler>

Total sessions: 12428

16:02

<tamsler>

so how are we doing ?

16:03

<rsmartmike>

not responsive for me at the moment

16:03

<wenk>

same here

16:03

<rsmartmike>

oracle load is very low, .1

16:03

<wenk>

So we killed the app this time

16:03

<wenk>

but not the db

16:03

<rsmartmike>

seeing errors in apache

16:03

<rsmartmike>

could not connect to tomact

16:03

<wenk>

I guess that's a positive

16:03

<wenk>

just came back

16:03

<jbush>

yeah I think thats positive

16:03

<rsmartmike>

so we exceeded tomcat's 500 threads per server seems like

16:04

<wenk>

you'd see that message in the logs

16:04

<rsmartmike>

yeah, weird, i don't think i did

16:05

<wenk>

when i first started using silk performer I did that

16:05

<wenk>

it gave a nice message

16:05

<wenk>

cant remember what it was exactly

16:05

<jbush>

yeah, damn open source apps

16:05

<tamsler>

any out of memory errors

16:06

<wenk>

Sorry this was a bit over 2 years ago ;)

16:06

<wenk>

no, don't see any out of memory

16:06

<wenk>

you guys use jconsole ?

16:07

<rsmartmike>

i've use jvmstat, jconsole is newer

16:07

<rsmartmike>

?

16:07

<wenk>

sigh, I'd love to look at the memory use graph

16:07

<wenk>

plus tomcat gives a bunch of mbeans with fun stuff

16:08

<wenk>

stats on stuff and what not

16:08

|<--

jbush has left freenode ("Leaving.")

16:09

<wenk>

so any of those firewall blocks ?

16:09

<rsmartmike>

are you guys dialing in to conf. line b?

16:09

<rsmartmike>

we are on.

16:09

<rsmartmike>

dave says no.

16:10

<rsmartmike>

fw looks good

16:10

<wenk>

are we supposed to call in a conf ?

16:10

<wenk>

I didn't know I needed to, but sure can

16:11

<kdalex>

okay, on our conf line or yours...sorry we missed the time due to a local crisis

16:11

<rsmartmike>

ours, one sec

16:11

<markpankow>

should I me on this as well?

16:12

<kdalex>

same number as the last call?

16:12

<rsmartmike>

916-233-4242 / 571605

16:12

<rsmartmike>

mark: yes

16:12

<kdalex>

we are coming

16:12

<rsmartmike>

there was some oracle memeory swapping

16:12

<rsmartmike>

but the load was ok

16:24

<Prabhu>

Is the DB server 64 bit? whats the RAM? what % of RAM was used for swap?

16:25

<rsmartmike>

16gb ram, about 1.3 or 1.4 gb showed as swap used

16:25

<rsmartmike>

64-bit centOS 5.4

16:29

<Prabhu>

We got to give 0.75 times the size of RAM for swap on 64bit

16:31

<Prabhu>

Well...that is oracle recommended!

16:34

<kdalex>

so what is the swap size on the database server?

16:36

<Prabhu>

If it is in the right size (12 GB), using 1.3 or 1.4 is absolutely not an issue

16:37

<wenk>

mike, what's the output of free -m ?

16:38

<wenk>

Well 1800 users is still running on the server

16:38

<wenk>

and its quite fast

16:39

<wenk>

is your deployment done ?

16:40

<wenk>

Looks like it to me, see the startup time in server logs

16:46

<wenk>

rsmartmike: You still there?

16:50

-->|

jbush (n=Adium@ip68-3-76-94.ph.ph.cox.net) has joined #ucdsakai

16:50

<wenk>

Hey John

16:53

<wenk>

our 1800 are running

17:01

<jbush>

I've got 250 running and a bunch of timeouts

17:01

<jbush>

oh wait 661

17:01

<jbush>

you seeing any timeouts from grinder ?

17:02

<jbush>

now I'm up to 1000, still ramping up

17:02

<wenk>

errors

17:03

<wenk>

I wonder what we'll see when we just run 1800

17:03

|<--

tamsler has left freenode (Remote closed the connection)

17:03

<wenk>

how's the db server?

17:04

<wenk>

allright folks

17:04

<wenk>

I'm gonna drop, I'll monitor thru night

17:04

<wenk>

I'll leave this up for the moment

17:16

|<--

kdalex has left freenode (Read error: 110 (Connection timed out))

17:46

|<--

Prabhu has left freenode ()

18:21

-->|

markpankow_ (n=chatzill@68.3.111.203) has joined #ucdsakai

18:30

|<--

markpankow has left freenode (Read error: 110 (Connection timed out))

20:38

|<--

jbush has left freenode ("Leaving.")

22:07

-->|

jbush (n=Adium@68.3.76.94) has joined #ucdsakai

22:13

|<--

jbush has left freenode ("Leaving.")

23:30

-->|

jbush (n=Adium@68.3.76.94) has joined #ucdsakai

03:33

|<--

markpankow_ has left freenode ("ChatZilla 0.9.86 Firefox 3.5.5/20091102134505")

06:43

*christel*

Global Notice Hi all, we are (surprise, surprise) still experiencing DDoS, after a quiet period it just started up again. Apologies for the inconvenience.

07:03

|<--

jbush has left freenode ("Leaving.")

 Â