IRC Chat 12162009
|
Irc 12162009 chat with rSmart |
|
#ucdsakai
09:29 |
INFO |
Channel view for “#ucdsakaiâ€ï¿½ opened. |
|
09:29 |
=-= |
User mode for smavocet is now +i |
|
09:29 |
-->| |
YOU (smavocet) have joined #ucdsakai |
|
09:30 |
<kdalex> |
yes, they look okay to me too. |
|
09:31 |
-->| |
markpankow (n=chatzill@72.44.192.164) has joined #ucdsakai |
|
09:31 |
<mwenk> |
If its the problem we had, and IIRC the problem was the cost based optimizer wasn't using the index |
|
09:31 |
<mwenk> |
because the stats on the table were whacked |
|
09:32 |
<kdalex> |
yes, we should have Mark look at that as well as give us the load average on the db machine |
|
09:32 |
-->| |
tamsler (n=chatzill@69-12-224-247.dsl.static.sonic.net) has joined #ucdsakai |
|
09:33 |
|<-- |
tamsler has left freenode (Remote closed the connection) |
|
09:33 |
-->| |
tamsler (n=chatzill@69-12-224-247.dsl.static.sonic.net) has joined #ucdsakai |
|
09:34 |
<tamsler> |
sorry for the dealay |
|
09:34 |
<tamsler> |
I just replied to John's email |
|
09:34 |
<tamsler> |
here is the reply: |
|
09:34 |
<tamsler> |
Assignment definitely makes calls to CM. We added the OSIVF to that |
|
09:34 |
<tamsler> |
tool. I think that's one of the tools that you guys have to retrofit |
|
09:35 |
<tamsler> |
with osiv.properties |
|
09:35 |
<mwenk> |
Isn't assignments under OSIV now? |
|
09:35 |
<kdalex> |
how many other places did we do this? |
|
09:35 |
<tamsler> |
checking .. |
|
09:36 |
<tamsler> |
I am looking at https://svn.ucdavis.edu/svn/ucd-sakai/patches/branches/sakai_2-5-x-prod/patch-list.txt |
|
09:36 |
<mwenk> |
web ? |
|
09:36 |
<mwenk> |
it uses section awareness? |
|
09:36 |
<mwenk> |
wow |
|
09:37 |
<tamsler> |
roster, assignment, msgcntr, sections, chat, mailtool, web, gradebook, sections |
|
09:38 |
<tamsler> |
is somebody analyzing the catalina.out w.r.t. osivf errors |
|
09:39 |
<tamsler> |
I still don't have access to their system |
|
09:39 |
<tamsler> |
because 2.6 may have other tools that use cm |
|
09:39 |
<tamsler> |
like OSP etc |
|
09:40 |
<mwenk> |
I can copy those somewhere you can look at em |
|
09:40 |
<mwenk> |
lemme do that now |
|
09:40 |
-->| |
Prabhu (n=prabhuem@psl-87.ucdavis.edu) has joined #ucdsakai |
|
09:41 |
<tamsler> |
brb |
|
09:41 |
<Prabhu> |
Mark, When was the last table switch(ta/tb) ran. And was the successful? |
|
09:42 |
<kdalex> |
John Bush...are they any other URL's that are failing? |
|
09:42 |
<kdalex> |
or hanging in the load test? |
|
09:42 |
<jbush> |
that was the worst |
|
09:43 |
<jbush> |
once sec, I'm not sure I saved the report |
|
09:43 |
I just sent the group the explain plans for the queries Prabhu sent. Full table scans on coursemanagement_section. |
||
09:44 |
<smavocet> |
Statistics problem like we had, Prabhu? |
|
09:45 |
<kdalex> |
any idea from anyone why we didn't see this on last weeks load test? |
|
09:46 |
<Prabhu> |
i need to know when and how the last table switch script ran |
|
09:46 |
<Prabhu> |
Mark? |
|
09:46 |
<mwenk> |
Thomas: rsmart catalina files copied to sakai@caje:/ucd/opt/sakai/rsmart_logs |
|
09:50 |
-->| |
kzinti34 (n=chatzill@psl-194.ucdavis.edu) has joined #ucdsakai |
|
09:50 |
<tamsler> |
thanx mike |
|
09:52 |
<kdalex> |
Prabhu, what will the time of the table switch tell you...about stats? Can we see if the stats are in place from here? |
|
09:53 |
<tamsler> |
looks like auto.ddl is enabled not that this is a problem per see but we never did that |
|
09:53 |
<kdalex> |
number of oracle connections is dropping but performance is still slow |
|
09:53 |
<jbush> |
its really bad |
|
09:54 |
<jbush> |
I'm not running any tests over here |
|
09:55 |
<mwenk> |
Probably the queries from the test havent' returned yet |
|
09:55 |
<mwenk> |
We saw that too |
|
09:55 |
<mwenk> |
sessions would hang until we killed the app |
|
09:55 |
<smavocet> |
Is Mark still on IRC? Prabhu is looking for him. |
|
09:55 |
<mwenk> |
sessions as in DB sessions |
|
09:55 |
thomas, i have copied your pubkey to the server. you will be able to log in now. the other key was messed up and had the incorrect permissions...so that's all fixed. |
||
09:56 |
<tamsler> |
I wonder if the assignment tables are ok in the internal DB |
|
09:56 |
<tamsler> |
lots of errors in catalina.out |
|
09:57 |
sakai_external is the proper schema name... |
||
09:58 |
<Prabhu> |
OK |
|
09:58 |
we use 'sakai' and 'sakai_external' |
||
09:58 |
<Prabhu> |
thanks |
|
09:59 |
I just replied to your email prabhu. Sorry, I was on the phone with Dave. |
||
09:59 |
sakaiext in the pilot. Sakai_external here. |
||
09:59 |
<tamsler> |
I am still not able to ssh tpamsler@smartsite-test.rsmart.com |
|
09:59 |
<mwenk> |
try tamsler@ |
|
10:00 |
<kdalex> |
i tried running the query you sent, Prabhu, from TOAD but I get |
|
10:00 |
<kdalex> |
ORA-02404: specified plan table not found |
|
10:00 |
<jbush> |
do we want to recycle the app servers ? |
|
10:00 |
yes, your userid is 'tamsler' |
||
10:01 |
<Prabhu> |
Kirk, we will not be able to run these quries from our end...we need dba priv on the server itself |
|
10:02 |
oracle server load is still high--do we know why if there are no load tests running? |
||
10:03 |
<Prabhu> |
Mark, I have sent drop stats statements. Please run them and repeat the autotrace |
|
10:03 |
<kdalex> |
I show valid stats on course_management_ta with last date = 12/16/2009 5:02:54 AM |
|
10:04 |
<mwenk> |
jbush, your assignment check relied on an preexisting assignment being there? |
|
10:04 |
<mwenk> |
or did you create an assignment for the test? |
|
10:04 |
<kdalex> |
db sessions have just dropped to 3 from 400 |
|
10:05 |
<smavocet> |
Prabhu sent him delete stats from various tables. |
|
10:05 |
<mwenk> |
did you guys cycle the app servers? |
|
10:05 |
<smavocet> |
sent Mark |
|
10:05 |
oracle server load plummeted |
||
10:05 |
have not cycled tomcat...i can do that now, just give the word. |
||
10:06 |
<kdalex> |
dont think they recycled the servers, I didn't loose my session |
|
10:06 |
<mwenk> |
cool |
|
10:06 |
<mwenk> |
did mark kill the stats then? |
|
10:06 |
<smavocet> |
So once Mark runs the delete stats from several tables, we'll start the load test again? |
|
10:07 |
<jbush> |
much better now |
|
10:07 |
so, cycle tomcat or no/ |
||
10:07 |
<Prabhu> |
:) Lets hear from Mark first |
|
10:07 |
<jbush> |
I don't think its necessary |
|
10:07 |
<mwenk> |
I think the assignment NPEs are because whatever assignment it was looking for doesn't exist |
|
10:07 |
<mwenk> |
probably because it was created on the old data but not the new data |
|
10:08 |
<Prabhu> |
we need to look at the trace and explain plans, to make sure there are no full scans |
|
10:08 |
<jbush> |
right |
|
10:08 |
<jbush> |
we should have corey re-record the test |
|
10:08 |
<jbush> |
now that the server are responsive again he can do that |
|
10:08 |
<mwenk> |
I can't believe grinder cant do thinks like clicking etc |
|
10:08 |
<jbush> |
I was going to have him do test8, once with chat and once without chat |
|
10:08 |
<mwenk> |
reminds me of jmeter |
|
10:09 |
<jbush> |
well thats why we use selenium |
|
10:09 |
<mwenk> |
silk is like selenium |
|
10:09 |
<mwenk> |
in that way anyways |
|
10:09 |
<mwenk> |
well it can do both |
|
10:09 |
<kdalex> |
so did mark do anything to the db to get it to clear? any word on dropping the stats |
|
10:10 |
<tamsler> |
same with tamsler it just prompts for pw instead pass phrase |
|
10:11 |
<smavocet> |
Mark just sent something in email. |
|
10:11 |
<Prabhu> |
i got explain plans from mark...they look great |
|
10:11 |
<smavocet> |
Explain plan |
|
10:12 |
<mwenk> |
owa is being evil to me |
|
10:12 |
<smavocet> |
so all stats have been removed, Prabhu? |
|
10:12 |
<Prabhu> |
yes |
|
10:12 |
<smavocet> |
no more full table scans? |
|
10:12 |
<Prabhu> |
CM quries should go fine now |
|
10:12 |
<smavocet> |
so are we ready to rerun the load test again now? |
|
10:12 |
<mwenk> |
So then do we need to have him cron run the delete stats? |
|
10:12 |
<Prabhu> |
yea...i will send him details |
|
10:13 |
<smavocet> |
Load test? |
|
10:13 |
<Prabhu> |
they may have to adjust auto stats windows too |
|
10:13 |
<jbush> |
corey is re-recording tests, then I have to adjust manually adjust to get the users in there, will probably take 15-20 minutes or so |
|
10:13 |
<smavocet> |
So stats run only at a particular time? |
|
10:13 |
<smavocet> |
Prabhu? |
|
10:14 |
<smavocet> |
ok, wrt jbush |
|
10:14 |
<smavocet> |
Didn't Mark say he had stats running hourly? |
|
10:14 |
<mwenk> |
Once you do that jbush, please commit em to svn |
|
10:16 |
<kdalex> |
I didn't get Mark's latest email. Can someone forward to the list, please? |
|
10:16 |
<Prabhu> |
yea...we can define the windows based on our work load. |
|
10:17 |
<smavocet> |
Done wrt Kirk. |
|
10:18 |
<kdalex> |
Doe we need a phone call between Prabhu and Mark and others to make sure we get this db configuration and stats issue correct? This is absolutely critical! |
|
10:19 |
thomas, the authorized_keys file was owned by root...i changed that, but i don't think that will make a difference. |
||
10:19 |
not sure why your key's not workin. |
||
10:19 |
<tamsler> |
did you get it from my last email? |
|
10:19 |
yes |
||
10:20 |
<tamsler> |
hmm I am using that same key just fine on my systems |
|
10:20 |
<tamsler> |
have you added all the IPs that I have specified in that email? |
|
10:20 |
<smavocet> |
wrt kirk, Mark and Prabhu can call each other anytime. They have each other's cell numbers. |
|
10:20 |
<tamsler> |
or is ssh wide open? |
|
10:21 |
<kdalex> |
no ssh needs to be from the ip you gave them originally |
|
10:21 |
not sure about the ip's...i will double check |
||
10:21 |
<tamsler> |
I know, I specified my home IP and it is blocked |
|
10:21 |
dave may not have acted on that one, i will ping him right now |
||
10:21 |
<Prabhu> |
Kirk, I will explain about stats on an e-mail to mark and cc to you all |
|
10:22 |
<mwenk> |
Alright, since it will take a cpl for the load test, gonna use this time to get into PSL |
|
10:22 |
<kdalex> |
Thanks, Prabhu |
|
10:22 |
<mwenk> |
I'll be able to start the load test once I'm in |
|
10:22 |
<mwenk> |
that ok with kirk/sandra? |
|
10:23 |
<kdalex> |
Let's get you in here first... |
|
10:23 |
<--| |
mwenk has left #ucdsakai |
|
10:24 |
-->| |
littlelucca (n=joncarlo@bly.ucdavis.edu) has joined #ucdsakai |
|
10:25 |
<tamsler> |
here are the IPs just in case 169.237.11.246 |
|
10:26 |
<tamsler> |
169.237.11.247 |
|
10:26 |
<tamsler> |
69.12.224.247 |
|
10:32 |
<tamsler> |
Ok I was able to login now |
|
10:32 |
<tamsler> |
but only from the 169.237.11.246 ip and not the other two |
|
10:38 |
<kdalex> |
Prabhu, when can we expect your email? |
|
10:40 |
<jbush> |
ok tests committed, I haven't started test yet to make sure they work |
|
10:41 |
<kdalex> |
ok. guess you can run a short one to test and let us know |
|
10:41 |
<jbush> |
there are 2 news tests, test_8_with_chat.py and test_8_without_chat.py |
|
10:41 |
<smavocet> |
I'll let QA know. |
|
10:42 |
<smavocet> |
Starting when? |
|
10:42 |
<kdalex> |
lets start w/o chat. can we have mark watch...start when we know he is ready |
|
10:42 |
<jbush> |
running short test of first one now, just to make sure it works |
|
10:42 |
<jbush> |
ok |
|
10:42 |
<smavocet> |
Mark's not on irc |
|
10:42 |
<smavocet> |
Not sure he's watching. Is he? |
|
10:43 |
<kdalex> |
yes he is... |
|
10:43 |
<smavocet> |
He's there but not responding here. |
|
10:43 |
<smavocet> |
Responding only in email that I can tell. |
|
10:44 |
I am here. |
||
10:44 |
I am ready... |
||
10:45 |
<smavocet> |
ok, great. Hadn't seen input from you in some time. |
|
10:45 |
<smavocet> |
I'll let QA know. We're starting now, right? |
|
10:46 |
<kdalex> |
23 sessions now in the db again...showing cm queries. I cannot tell if there are full table scans |
|
10:46 |
<jbush> |
50 user test in process |
|
10:46 |
Sorry. I just had 4 other windows open looking at other issues. |
||
10:46 |
<kdalex> |
got this |
|
10:46 |
<kdalex> |
Table Scan: SAKAI_EXTERNAL.COURSEMANAGEMENT_SECTION_TA: 45792 out of 45792 Blocks done |
|
10:47 |
<smavocet> |
What does that mean? |
|
10:49 |
<smavocet> |
so far so good wrt QA and response time as you might expect with such a low load. |
|
10:52 |
<jbush> |
well, its almost done |
|
10:52 |
<jbush> |
looks like assignment is still a problem |
|
10:52 |
<jbush> |
https://smartsite-test.rsmart.com:8081//xsl-portal/tool/f3275f0a-74ed-467b-8cbd-42354ddbff11 |
|
10:52 |
<jbush> |
that url is taking on average 179 seconds |
|
10:53 |
<kdalex> |
W |
|
10:53 |
<jbush> |
I'll killing the test |
|
10:53 |
<jbush> |
what do what want to do at this point, seems like there is still a problem |
|
10:53 |
<kdalex> |
there are 2 consecutive slashes in that URL...is that ok? |
|
10:55 |
<jbush> |
we need some ideas |
|
10:55 |
<jbush> |
seems like to big things have changed here, new code, and new database |
|
10:55 |
<jbush> |
we could roll back to old tag and test that way |
|
10:56 |
<kdalex> |
We need input from Mark first... |
|
10:56 |
<jbush> |
ok |
|
10:57 |
<kdalex> |
To the UCD programmers....any more on the interrelationship between the URL above and CM? |
|
10:57 |
<kdalex> |
I just got a timeout so I cannot make that URL work for me |
|
10:58 |
<tamsler> |
checking ... |
|
10:58 |
837 |
||
10:58 |
sorry... wrong window |
||
10:58 |
<kdalex> |
Also, what is the meaning of the 8081 port? I just tried it without the port and it came back |
|
10:58 |
<tamsler> |
I can access it just fine ..?? |
|
10:59 |
<jbush> |
oh mike exposed 8081-8086 so you can hit tomcat directly |
|
10:59 |
<jbush> |
makes it easier when looking at logs |
|
10:59 |
<kdalex> |
What is the 837? |
|
10:59 |
<kdalex> |
Oh, great! That's good to know |
|
10:59 |
nothing... I typed that in the wrong window |
||
10:59 |
<kdalex> |
Well the URL works okay w/o the port...and I can get into Grade mode on the tool which accesses cm |
|
11:00 |
<tamsler> |
what's strange is that every other time I try to access https://smartsite-test.rsmart.com:8081//xsl-portal/tool/f3275f0a-74ed-467b-8cbd-42354ddbff11 it doesn't work |
|
11:00 |
<kdalex> |
yeah, it's hanging for me too....do we have to login first to that port to get a valid session on that server? |
|
11:00 |
<jbush> |
yes |
|
11:01 |
ports 8081-8086 should be accessible via those 30 or so IPs you sent |
||
11:01 |
if you want, i can paste them in here |
||
11:02 |
<jbush> |
|
|
11:03 |
<kdalex> |
oh, that's my problem...I am trying from home. |
|
11:04 |
<kdalex> |
Mark - any word on stats or current table scans? Looks like they are still happening though w/o DBA permissions I can't be sure |
|
11:05 |
<kdalex> |
DB load dropped to 0 again |
|
11:05 |
I am sending a report from the 11:25 to 11:45 timeframe. Things that looked out of whack before are looking better, but it takes me a bit to read through it. |
||
11:05 |
I am sending the email now. |
||
11:07 |
<kdalex> |
URL works from my fixed ip |
|
11:07 |
<jbush> |
yeah seems ok now for me too |
|
11:09 |
<kdalex> |
should we start the load again then and have Mark start both a report and do some realtime checking? |
|
11:09 |
<kdalex> |
Also, could he re-run Prabhu's query now while its quiet to see if we still show full table scan in the explain plan? |
|
11:09 |
<smavocet> |
That url resolved for me once but the second time, it couldn't access it. |
|
11:10 |
<smavocet> |
I'm on my fixed IP address also. |
|
11:10 |
<jbush> |
how many users do we want to do ? |
|
11:11 |
<kdalex> |
worked for me. |
|
11:12 |
<jbush> |
how about 1000 users ? |
|
11:12 |
<smavocet> |
Try it again, using a different browser, Kirk. |
|
11:12 |
<kdalex> |
If you switch browsers you have to relogin to the specific port (e.g. 8081) |
|
11:12 |
<kdalex> |
1000 seems ok. should we limit to 10 mins or so? |
|
11:13 |
<jbush> |
ok, 1000 users for 10 mins, here goes |
|
11:13 |
<kdalex> |
Mark, can you confirm you are monitoring? |
|
11:15 |
Okay, I am watching. Give me 10 seconds |
||
11:15 |
Olau gp |
||
11:15 |
okay go |
||
11:15 |
<jbush> |
about 250 users in so far |
|
11:16 |
I have it set to get a before snapshot and another in 10 minutes. |
||
11:16 |
<jbush> |
the assignment url looks fine so far |
|
11:17 |
<jbush> |
about 500 users now |
|
11:17 |
<kdalex> |
about 200 db connections |
|
11:18 |
<kdalex> |
so far so good |
|
11:18 |
<jbush> |
750 users |
|
11:19 |
<Prabhu> |
Whats the DB server load? |
|
11:20 |
24 |
||
11:20 |
<kdalex> |
trying to load nut 011 001-010 FQ 2009 - still waiting |
|
11:20 |
<smavocet> |
24 is a bit high, not so? |
|
11:21 |
<Prabhu> |
That is too high......is the load increasing with increase in users? |
|
11:21 |
<jbush> |
all 1000 are in now |
|
11:21 |
load is still at ~24 |
||
11:22 |
it does appear to increase with added users. |
||
11:22 |
<kdalex> |
Can anybody else get to a Site Info in a course site? |
|
11:22 |
<kdalex> |
576 database connections |
|
11:22 |
<jbush> |
try pulling up assignment tool in SPA 001 001-0011 site, crazy slow again |
|
11:23 |
<kdalex> |
yes, unfortunately. If Mark's report doesnt' show full table scans then we must have ruined it with the realm refreshes???? |
|
11:24 |
<jbush> |
this is concerning |
|
11:24 |
<jbush> |
2009-12-16 11:23:01,076 INFO ajp-10.10.12.20-8009-357 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955 |
|
11:24 |
<jbush> |
2009-12-16 11:23:02,193 INFO ajp-10.10.12.20-8009-360 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955 |
|
11:24 |
<jbush> |
2009-12-16 11:23:02,554 INFO ajp-10.10.12.20-8009-359 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/319258e7-c2ee-4eab-80c8-65365458c345 |
|
11:24 |
<jbush> |
2009-12-16 11:23:03,161 INFO ajp-10.10.12.20-8009-361 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955 |
|
11:24 |
<jbush> |
2009-12-16 11:23:05,609 INFO ajp-10.10.12.20-8009-362 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955 |
|
11:24 |
<jbush> |
2009-12-16 11:23:06,381 INFO ajp-10.10.12.20-8009-363 org.sakaiproject.authz.impl.BaseAuthzGroupService - refreshAuthzGroupIfNecessary(): refreshing /site/1b3bce07-6659-4aae-ba30-50b94f6d0955 |
|
11:25 |
<smavocet> |
Kirk, I can get to site info but it takes a long time. |
|
11:26 |
<smavocet> |
So jbush, the authz refresh is too frequent? Thomas? |
|
11:26 |
<kdalex> |
not me. did su to liz applegate and cannot visit her course sites at all |
|
11:28 |
<smavocet> |
spoke too soon. That was section info. Site info hasn't returned yet. |
|
11:28 |
<jbush> |
ah, they logs are cluttered with refreshes |
|
11:29 |
<jbush> |
its a refresh attack |
|
11:29 |
<kdalex> |
Thomas can you look at these logs too, please? |
|
11:29 |
<jbush> |
something is not right with this code, it shouldn't be refreshing like this, its multiple times a second, different threads |
|
11:30 |
<jbush> |
I'm looking at ironmaiden, ucdavis1 |
|
11:30 |
<kdalex> |
okay, good find. Thomas and Mike can you look at the code again. Do the logs show where in the code? |
|
11:31 |
<smavocet> |
Mike hasn't made it in yet. On his way. |
|
11:31 |
<smavocet> |
I'll share this script with him though when he gets in. |
|
11:34 |
well, at least we'll get up to date info in the site roster ;) |
||
11:35 |
<kdalex> |
Yeah, to the nanosecond! |
|
11:35 |
<kdalex> |
Glad you can keep a sense of humor while we're all pestering you in realtime! |
|
11:39 |
<smavocet> |
Are you stopping the load test now or keep it going for continued diagnostics? |
|
11:39 |
<jbush> |
I'm stopping it |
|
11:40 |
<smavocet> |
ok |
|
11:40 |
<kdalex> |
I just got Internal Server Error |
|
11:40 |
<kdalex> |
The server encountered an internal error or misconfiguration and was unable to complete your request. |
|
11:40 |
<kdalex> |
Please contact the server administrator, support@rsmart.com and inform them of the time the error occurred, and anything you might have done that may have caused the error. |
|
11:40 |
<kdalex> |
while trying to get to site info on a big site |
|
11:40 |
<jbush> |
this is what I think we should do |
|
11:41 |
<jbush> |
modify the refresh log to output the stack, so we can see where its coming from, make this change on one server, stop all the server and run with only the one server, try a load of 250 |
|
11:41 |
<kdalex> |
okay with me. Thomas, Mike: any comments? |
|
11:42 |
<smavocet> |
Mike's on his way into Chiles Road. |
|
11:42 |
<smavocet> |
Sounds good to me also but I'm no substitute for them. |
|
11:43 |
<tamsler> |
the logs show that refreshAuthzGroupIfNecessary is called way to many times |
|
11:44 |
<kdalex> |
can the code be made to check to not do it more than once in a session for a site or something to that effect? |
|
11:44 |
<jbush> |
yes |
|
11:44 |
<jbush> |
use a thread local or something |
|
11:45 |
<tamsler> |
or use a cache indicating when the last refresh was done |
|
11:45 |
<tamsler> |
similar we did before but without the timer |
|
11:45 |
<jbush> |
well that is what its supposed to be doing right now |
|
11:45 |
<tamsler> |
and then use the caches ttl to expire |
|
11:47 |
-->| |
wenk (n=mwenk@psl-242.ucdavis.edu) has joined #ucdsakai |
|
11:47 |
<smavocet> |
Hi Mike. I saved the irc script to update you. |
|
11:48 |
<wenk> |
Ok |
|
|
IRCchat12162009 with rSmart |
(missing from 11:48 to 13:05) |
|
|
|
|
|
#ucdsakai
13:05 |
<wenk> |
but then I'd wonder why it works on our 2.5 impl |
|
13:06 |
<jbush> |
synchronized (azGroup) Unknown macro: { || 13}
|
|
13:06 |
<jbush> |
I'm not sure all the azGroup objects are the same to all threads |
|
13:08 |
<jbush> |
I'm not sure this is even the problem |
|
13:08 |
<jbush> |
that 7 refresh calls in a 1/10 of sec |
|
13:09 |
<wenk> |
dunno either |
|
13:09 |
<wenk> |
but considering that down in that refresh code it is calling out CM and is wanting sections |
|
13:09 |
<wenk> |
its definitely not helping |
|
13:09 |
<kdalex> |
211 db connections but no long ops |
|
13:10 |
<kdalex> |
now only 5 db connections...clears fast |
|
13:10 |
<wenk> |
Look, all the reading I've done in the past(there was a discussion on it on sakai-dev a while back) about CHM vs the collections api wrapper says we should be using CHM |
|
13:10 |
<wenk> |
not the wrapper |
|
13:11 |
<jbush> |
ok well thats easy to change |
|
13:12 |
<wenk> |
Not sure if it will fix it |
|
13:12 |
<jbush> |
so like this? realmsRefreshed = new ConcurrentHashMap(size); |
|
13:14 |
<wenk> |
yu |
|
13:14 |
<wenk> |
p |
|
13:14 |
<wenk> |
CHM is supposed to scale better |
|
13:14 |
<wenk> |
so lets see if that's true |
|
13:14 |
<wenk> |
anyways, gonna grab some food, bb in 15 |
|
13:15 |
<jbush> |
ok we have server back up with the wrapper, lets test that first |
|
13:15 |
<smavocet> |
Response time seem good from ux perspective. |
|
13:16 |
<jbush> |
starting 1000 user test now |
|
13:17 |
<jbush> |
brb in 5 |
|
13:19 |
-->| |
corey1 (n=corey@internal.rsmart.com) has joined #ucdsakai |
|
13:19 |
<tamsler> |
why don't we init the cache in the init() method rather then in the refresh method ? |
|
13:20 |
<jbush> |
700 users on... |
|
13:21 |
<jbush> |
1000 users loaded up |
|
13:22 |
<smavocet> |
still looks pretty good for me. |
|
13:25 |
<smavocet> |
Can we let it run a little longer. QA was caught off guard. Thanks. |
|
13:26 |
<jbush> |
yep, response is really good from grinder |
|
13:26 |
<jbush> |
I don't see how this synchronized thing would have fixed it, did mark do something else ? |
|
13:30 |
I didn't do anything. I wish I could take credit. |
||
13:30 |
<tamsler> |
The logs still show serveral refresh entries per second |
|
13:31 |
|<-- |
kdalex has left freenode (Read error: 110 (Connection timed out)) |
|
13:34 |
-->| |
kdalex (n=chatzill@moobilenet-100-51.ucdavis.edu) has joined #ucdsakai |
|
13:34 |
<wenk> |
back |
|
13:37 |
<kdalex> |
me too |
|
13:39 |
<tamsler> |
so does the code use the concurrentHashMap now? |
|
13:40 |
<jbush> |
no, its using the wrapper |
|
13:44 |
<jbush> |
we can try this, change to use ConcurrentHashMap and the synchronize on the authgroup id, like this |
|
13:44 |
<jbush> |
public void refreshAuthzGroupIfNecessary(AuthzGroup azGroup){ |
|
13:44 |
<jbush> |
if(azGroup != null && azGroup.getId().startsWith("/site/") && !azGroup.getId().contains("/group/")) { |
|
13:44 |
<jbush> |
if(null==realmsRefreshed){ |
|
13:44 |
|<-- |
jbush has left freenode (Excess Flood) |
|
13:44 |
-->| |
jbush (n=Adium@internal.rsmart.com) has joined #ucdsakai |
|
13:45 |
<wenk> |
So what exactly changed? |
|
13:46 |
<jbush> |
add this synchronized (azGroup.getId()) { |
|
13:46 |
<jbush> |
and switched to CHM |
|
13:46 |
<wenk> |
is it running with that? |
|
13:46 |
<wenk> |
want to use pastebin ? |
|
13:46 |
<jbush> |
no, its only running with the wrapper |
|
13:46 |
<jbush> |
but there doesn't seem to be a performance problem right now |
|
13:46 |
<wenk> |
I know |
|
13:46 |
<jbush> |
should we try to ramp up to 3000 users first |
|
13:47 |
<wenk> |
which makes me wonder if its not this but something else |
|
13:47 |
<wenk> |
Sure, can do that |
|
13:47 |
<jbush> |
do we want to split it between us ? |
|
13:47 |
<wenk> |
I'll fire off 1800 from here |
|
13:48 |
<wenk> |
Also, lets turn debug off for this run |
|
13:48 |
<tamsler> |
I think we still have an issue |
|
13:48 |
<tamsler> |
looking at the logs, I see several calls to refresh |
|
13:48 |
<jbush> |
if we are going to bounce should we then make the code mods |
|
13:49 |
<tamsler> |
but only the first log message shows up containing refreshing but not the one containing group |
|
13:49 |
<tamsler> |
this is due to the exception being thrown |
|
13:49 |
<tamsler> |
if we take debug out we will tank again |
|
13:49 |
<jbush> |
what exception ? |
|
13:49 |
<jbush> |
I put a stackTrace in on purpose, its not an exception if that is what you are talking about |
|
13:50 |
<wenk> |
yes but there's a ton of overhead |
|
13:50 |
<jbush> |
ah |
|
13:50 |
<wenk> |
If its a timing issue between the threads that will smooth it out a bit |
|
13:51 |
<kdalex> |
are we taking time out for the conference call in 10 min? |
|
13:51 |
<jbush> |
oh I see |
|
13:51 |
<tamsler> |
I think it's still not working as we expect it to |
|
13:52 |
<tamsler> |
if you do: grep refreshAuthzGroupIfNecessary catalina.out.server6 | grep -v site/~ | grep -v 1128 |
|
13:52 |
<tamsler> |
you will see that we are getting several refreshed per second |
|
13:53 |
<jbush> |
then I think we should add in the azgroup blocking |
|
13:53 |
<jbush> |
seems like that would address it if it a timing issue |
|
13:55 |
<wenk> |
sync around refreshAuthzGroupIfNecessary |
|
13:55 |
<jbush> |
really, that seems agressive |
|
13:55 |
<wenk> |
wait |
|
13:57 |
<wenk> |
sorry looking at the timer |
|
13:57 |
<wenk> |
something seems odd between the old and new |
|
13:58 |
<wenk> |
yeah, we had that timer thread thing |
|
13:58 |
<jbush> |
we are joining the call now |
|
13:59 |
<wenk> |
will you stay on irc as well? |
|
13:59 |
<kdalex> |
coming |
|
14:00 |
<kdalex> |
room confloct coming |
|
14:03 |
<kdalex> |
thomas and mike...come join the call |
|
14:04 |
<Prabhu> |
I will stay on IRC |
|
14:04 |
<kdalex> |
ok |
|
14:08 |
|<-- |
jbush has left freenode (Read error: 60 (Operation timed out)) |
|
14:11 |
-->| |
jbush (n=Adium@internal.rsmart.com) has joined #ucdsakai |
|
14:35 |
<wenk> |
Back |
|
14:35 |
<wenk> |
Lemme know when to kick off the ucd side of the load test |
|
14:41 |
|<-- |
jbush has left freenode ("Leaving.") |
|
14:43 |
-->| |
jbush (n=Adium@internal.rsmart.com) has joined #ucdsakai |
|
14:46 |
ok, app servers coming down for new tag deployment. probably be about 15-20 minutes before they can be back up. |
||
14:48 |
<wenk> |
okay |
|
14:48 |
<wenk> |
lemme know when I should start the load test from here |
|
14:49 |
roger |
||
15:02 |
<smavocet> |
Oh, I wasn't watching |
|
15:02 |
<smavocet> |
We're starting the load test now? |
|
15:04 |
<--| |
corey1 has left #ucdsakai |
|
15:13 |
<wenk> |
mike, how we doin? |
|
15:15 |
<wenk> |
rsmartmike: How we doin ? |
|
15:16 |
|<-- |
kzinti34 has left freenode ("ChatZilla 0.9.86 Firefox 3.5.6/20091201220228") |
|
15:17 |
compiling...prob less than 5 minutes |
||
15:18 |
<smavocet> |
ok |
|
15:19 |
<wenk> |
ok |
|
15:22 |
starting up... |
||
15:23 |
<--| |
littlelucca has left #ucdsakai |
|
15:27 |
<wenk> |
so kick off the load test? |
|
15:27 |
<jbush> |
yes, started 1200 users over here |
|
15:28 |
<jbush> |
still ramping mine up |
|
15:28 |
<wenk> |
which script should I call? w/o chat? |
|
15:29 |
<jbush> |
that is the one I'm using |
|
15:29 |
<wenk> |
ok |
|
15:32 |
<wenk> |
starting |
|
15:34 |
<jbush> |
my 1200 are going, response time avg is 72 millis |
|
15:36 |
<wenk> |
Oops, didn't put the script in for my last node |
|
15:36 |
<wenk> |
only running 1200 |
|
15:36 |
<wenk> |
last 6 will be up in a sec |
|
15:37 |
db load is < 1 |
||
15:37 |
<jbush> |
grinder is getting timeouts now, but response time still good |
|
15:37 |
<jbush> |
dave is checking firewall/apache, could be there |
|
15:37 |
db is looking great so far, at least... |
||
15:38 |
<jbush> |
nice! |
|
15:38 |
we are seeing tons of cache hits for authz stuff |
||
15:38 |
<wenk> |
All 1800 up |
|
15:38 |
<wenk> |
Oops, somehow the props I had to ramp up were lost |
|
15:38 |
<wenk> |
so all 1200 and then the 600 connected at once |
|
15:38 |
<wenk> |
Getting errors |
|
15:39 |
authz cache size is 350-500 on the 3 servers i looked at |
||
15:39 |
<wenk> |
again I wish there was a ghost busting button in the Online tool |
|
15:40 |
<wenk> |
another min b4 we get the new catalina.out |
|
15:40 |
could we just clean up the session table? woudl that effect the same result? |
||
15:40 |
<wenk> |
cluster table I think |
|
15:40 |
<wenk> |
I'd not worry about it, its just annoyin |
|
15:40 |
<jbush> |
as soon as you go live we'll start working on that :) |
|
15:41 |
oracle load 1.5 |
||
15:41 |
<wenk> |
I'm kicking off timings |
|
15:41 |
<tamsler> |
refreshes look really good |
|
15:41 |
<tamsler> |
lot's of cache hits |
|
15:42 |
<tamsler> |
and no multiple refreshed per realm |
|
15:42 |
<wenk> |
Yeah |
|
15:43 |
oracle load 3.31, not bad. |
||
15:44 |
better than 47! |
||
15:44 |
<kdalex> |
Now that's more like it. |
|
15:45 |
<wenk> |
slow performance now |
|
15:45 |
<wenk> |
tho honestly I dunno how much its the site and how much its my client at this point |
|
15:46 |
oracle load .23 |
||
15:46 |
it is not responsive all of a sudden for me too |
||
15:46 |
<tamsler> |
I am not getting in anymore |
|
15:47 |
<jbush> |
yeah my browser got a timeout |
|
15:47 |
<tamsler> |
what is the server load on the two systems that run apache compared to the other nodes |
|
15:47 |
.13 & .15 |
||
15:47 |
on 2 apache servers |
||
15:48 |
.35 now on one of them |
||
15:48 |
<tamsler> |
I wonder if we are running out of tomcat threads |
|
15:48 |
and it's back |
||
15:48 |
oracle load to 7.49 |
||
15:49 |
<jbush> |
weird its happy again |
|
15:49 |
<tamsler> |
how many threads are allocated per tomcat in server.xml |
|
15:49 |
<tamsler> |
is it the 150 default? |
|
15:50 |
kswapd1 is active on teh oracle box |
||
15:50 |
that doesn't seem like a good ting |
||
15:51 |
apache load: 2.26/0.65 |
||
15:51 |
<tamsler> |
I am seeing lots of the following errors: |
|
15:51 |
<tamsler> |
2009-12-16 15:28:22,247 ERROR ajp-10.10.12.45-8009-54 org.sakaiproject.tool.su.SuTool - SuTool Fatal Error: You must be an administrator to become another user. null |
|
15:52 |
<wenk> |
maybe 3000 is too many users |
|
15:52 |
<wenk> |
brb |
|
15:54 |
1.28gb swap in use on oracle box |
||
15:54 |
<tamsler> |
we are running 250 threads per tomcat instance |
|
15:58 |
<tamsler> |
Total sessions: 12428 |
|
16:02 |
<tamsler> |
so how are we doing ? |
|
16:03 |
not responsive for me at the moment |
||
16:03 |
<wenk> |
same here |
|
16:03 |
oracle load is very low, .1 |
||
16:03 |
<wenk> |
So we killed the app this time |
|
16:03 |
<wenk> |
but not the db |
|
16:03 |
seeing errors in apache |
||
16:03 |
could not connect to tomact |
||
16:03 |
<wenk> |
I guess that's a positive |
|
16:03 |
<wenk> |
just came back |
|
16:03 |
<jbush> |
yeah I think thats positive |
|
16:03 |
so we exceeded tomcat's 500 threads per server seems like |
||
16:04 |
<wenk> |
you'd see that message in the logs |
|
16:04 |
yeah, weird, i don't think i did |
||
16:05 |
<wenk> |
when i first started using silk performer I did that |
|
16:05 |
<wenk> |
it gave a nice message |
|
16:05 |
<wenk> |
cant remember what it was exactly |
|
16:05 |
<jbush> |
yeah, damn open source apps |
|
16:05 |
<tamsler> |
any out of memory errors |
|
16:06 |
<wenk> |
Sorry this was a bit over 2 years ago ;) |
|
16:06 |
<wenk> |
no, don't see any out of memory |
|
16:06 |
<wenk> |
you guys use jconsole ? |
|
16:07 |
i've use jvmstat, jconsole is newer |
||
16:07 |
? |
||
16:07 |
<wenk> |
sigh, I'd love to look at the memory use graph |
|
16:07 |
<wenk> |
plus tomcat gives a bunch of mbeans with fun stuff |
|
16:08 |
<wenk> |
stats on stuff and what not |
|
16:08 |
|<-- |
jbush has left freenode ("Leaving.") |
|
16:09 |
<wenk> |
so any of those firewall blocks ? |
|
16:09 |
are you guys dialing in to conf. line b? |
||
16:09 |
we are on. |
||
16:09 |
dave says no. |
||
16:10 |
fw looks good |
||
16:10 |
<wenk> |
are we supposed to call in a conf ? |
|
16:10 |
<wenk> |
I didn't know I needed to, but sure can |
|
16:11 |
<kdalex> |
okay, on our conf line or yours...sorry we missed the time due to a local crisis |
|
16:11 |
ours, one sec |
||
16:11 |
should I me on this as well? |
||
16:12 |
<kdalex> |
same number as the last call? |
|
16:12 |
916-233-4242 / 571605 |
||
16:12 |
mark: yes |
||
16:12 |
<kdalex> |
we are coming |
|
16:12 |
there was some oracle memeory swapping |
||
16:12 |
but the load was ok |
||
16:24 |
<Prabhu> |
Is the DB server 64 bit? whats the RAM? what % of RAM was used for swap? |
|
16:25 |
16gb ram, about 1.3 or 1.4 gb showed as swap used |
||
16:25 |
64-bit centOS 5.4 |
||
16:29 |
<Prabhu> |
We got to give 0.75 times the size of RAM for swap on 64bit |
|
16:31 |
<Prabhu> |
Well...that is oracle recommended! |
|
16:34 |
<kdalex> |
so what is the swap size on the database server? |
|
16:36 |
<Prabhu> |
If it is in the right size (12 GB), using 1.3 or 1.4 is absolutely not an issue |
|
16:37 |
<wenk> |
mike, what's the output of free -m ? |
|
16:38 |
<wenk> |
Well 1800 users is still running on the server |
|
16:38 |
<wenk> |
and its quite fast |
|
16:39 |
<wenk> |
is your deployment done ? |
|
16:40 |
<wenk> |
Looks like it to me, see the startup time in server logs |
|
16:46 |
<wenk> |
rsmartmike: You still there? |
|
16:50 |
-->| |
jbush (n=Adium@ip68-3-76-94.ph.ph.cox.net) has joined #ucdsakai |
|
16:50 |
<wenk> |
Hey John |
|
16:53 |
<wenk> |
our 1800 are running |
|
17:01 |
<jbush> |
I've got 250 running and a bunch of timeouts |
|
17:01 |
<jbush> |
oh wait 661 |
|
17:01 |
<jbush> |
you seeing any timeouts from grinder ? |
|
17:02 |
<jbush> |
now I'm up to 1000, still ramping up |
|
17:02 |
<wenk> |
errors |
|
17:03 |
<wenk> |
I wonder what we'll see when we just run 1800 |
|
17:03 |
|<-- |
tamsler has left freenode (Remote closed the connection) |
|
17:03 |
<wenk> |
how's the db server? |
|
17:04 |
<wenk> |
allright folks |
|
17:04 |
<wenk> |
I'm gonna drop, I'll monitor thru night |
|
17:04 |
<wenk> |
I'll leave this up for the moment |
|
17:16 |
|<-- |
kdalex has left freenode (Read error: 110 (Connection timed out)) |
|
17:46 |
|<-- |
Prabhu has left freenode () |
|
18:21 |
-->| |
markpankow_ (n=chatzill@68.3.111.203) has joined #ucdsakai |
|
18:30 |
|<-- |
markpankow has left freenode (Read error: 110 (Connection timed out)) |
|
20:38 |
|<-- |
jbush has left freenode ("Leaving.") |
|
22:07 |
-->| |
jbush (n=Adium@68.3.76.94) has joined #ucdsakai |
|
22:13 |
|<-- |
jbush has left freenode ("Leaving.") |
|
23:30 |
-->| |
jbush (n=Adium@68.3.76.94) has joined #ucdsakai |
|
03:33 |
|<-- |
markpankow_ has left freenode ("ChatZilla 0.9.86 Firefox 3.5.5/20091102134505") |
|
06:43 |
*christel* |
Global Notice Hi all, we are (surprise, surprise) still experiencing DDoS, after a quiet period it just started up again. Apologies for the inconvenience. |
|
07:03 |
|<-- |
jbush has left freenode ("Leaving.") |
  |