Tuesday, November 4, 2008

BLog Post: When a small bug stops the giant dinosaur - someone will like that...

Yesterday was one of those rare days, one of those where I spent 7 hours staring to my screen trying to push couple of server to work, the problem started at 11 AM and solved by 7 PM, but the trick was in finding the problem.

 

Briefly the problem located in specific servers in a branch of mine, the damn servers didn’t want to become a domain controller no matter what I did, the problem that the SYSVOL folder didn’t want to replicate between server in HQ and server in the branch, below what was happening:

 

When I launched the Domain controller installation wizard, everything went nice, to complete the installation a restart is required and a replication of the SYSVOL folder (where group policies and domain settings resides) has to be completed after the restart, this normally takes 15 minutes, tracing the SYSVOL replication I found that the replication stops at 14 MB of the SYSVOL (it is about 120 MB), well this is normal.

 

Recycle the SYSVOL replication, 13.8 MB.

Check the SYSVOL replication scope using ntfrsutil seems right, push it using the command line, 13.6 MB.

Well, this is becoming weird, so let us bring the army. Tracing the Filesystem activity using Filmon and process explorer and…..ok I saw that the FRS.exe cannot have access to the replication delta files  after it reaches 13 MB, it gives me “Access is denied or file cannot be found”.

 

It is ok, using the burflag registry key I can rebuild the hall SYSVOL, trying that and silly me; it is 20 MB

 

Fine, let us skip the first server, let us bring the other server, now guess what, cannot exceed 25.5, first round 25.6, second round 25.4 third round 24.7, using filemon and PE I found the same output “Access is denied”.

 

Now it is time to know exactly why access is denied, doing advanced logging for the FRS, I found that transactional log files exist, but the FRS cannot read and write from it, doing a performance monitoing for the disk level read/writes I found that it gets so slow when a big chunk of log file gets replicated to SYSVOL temp. folder, doing a Memory trace for the handles using handle……wolla!!!!!!!!!! It is the KASPESKY.

 

The avp.exe process was scanning the a big log file (it is about 30 MB in size), the scanning takes so much time to complete, the scanning windows was bigger than the SYSVOL replication timeout windows this is a known bug here , so it times out and keeps retrying. And failing at the same file. So a very small registry key stopped my Servers to be come on line, what a mess?!

 

I checked the Kaspersky policy, I found the SYSVOL folder was excluded from the scan, but not the subfolders, added them and now I am able to run the SYSVOL replication. A smart one will say why this didn’t occur in HQ, as far as I can see this is because of HQ runs at 4 GB of RAM.

 

Nice round.

Mahmoud

1 comment:

Dr.Kernel said...

I told you.. you told me it's KAV the best and so on :), it's good scanning engine but we have to maintain the CIA triangle, right?