Production Server filling up with Redo Logs resulting in space issue. Deep research!

Situation: We have a production portal which connects to our Oracle  database to retrieve some data. We have been getting ORA-00257  archiver error on the portal quite often which makes us look bad. Most of the servers are UNIX based. One of our teammates has been looking at the site and letting us know of this errorso that we fix it before some user looks at it.

Reason for the error: This error usually shows up when the partition or hard disk which is used by Oracle for writing the redo logs is full.

How to check: Connect to the Oracle DB server and check the disk space on it using df -h in case of UNIX. Pretty obviously the partition which has 100% used is the hard disk that Oracle uses to write the redo logs. In case there are more than one partitions that show 100% usage, you need to figure out the redo log writing location from the show parameter DB_RECOVERY_FILE_DEST command in Sqlplus.

After figuring out the destination or location where Oracle is filling up the redo logs you can backup/move some of the older redo logs to a different location to free up space for the issue to be solved immediately.

Not solved yet?: In our case, Oracle kept filling up the redo log files again and again and we had to clean it up again. We went through various articles on oracle errors to figure the root cause of this issue. We first doubted the Tape Drive because we had a recent tape drive failure and this might be linked to it.

We looked into various stored scripts on the server which ran on a daily basis and looked at the logs. Looking at the logs for the RMAN scripts on the database server pointed us toward the RMAN-00571 and ORA-19502 errors which were related to space issues too.

These scripts were written probably by an previous DBA.They archive all the redo logs and deletes them at the end of the day. These were not able to successfully complete due to space not being sufficient on the hard disk.

Solution: Space clean up and making sure the archiving and deleting process are running properly at the end of the day so that the next redo log writing process has enough space to use.

Conclusion: Don’t just look for a direct solution to a problem in the IT field. There might be more than one cause for a single problem. One problem leads to another and then to another.

Coldfusion 7 MX cannot understand oracle sometimes

Like every Monday,  I came into work today and got a complaint about our portal’s coldfusion business application not working.  The coldfusion applications were built on Coldfusion 7 MX and the backend was Oracle 11g. The exception on the portal look as below:

Portal Exception
The Coldfusion Exception

After 30 minutes of troubleshooting, I went into the coldfusion admin page and noticed that when verifying the oracle connection it gave an “Internal error: Net8 protocol error” message. Later on googling a bit, I understood from this forum post that the exception was because coldfusion 7 MX was unable to understand the “password about to expire” signals from oracle for the user that coldfusion uses to connect to oracle. So we reset the password for that user in the oracle database using the alter command and the problem was fixed.

Thanks to Google and the posted solution this production issue was solved within an hour.


sendmail service queue clearing and ORA-24247 error fix using instructions to add ACL info.

The other day at work we had a request from the client to send an e-blast (mass e-mail) to all the email addresses listed in our database (~55k emails).

In the past this was done using a procedure in the oracle database which used one of our smtp mail servers and oracle UTL_SMTP package. But due to a missing smtp mail server we could not run the procedure. We tried changing the mail server ip from a non-working one to a working one, but this didnt work. We kept getting an error that said

ORA-24247 network access denied by access control list (ACL)

As this was a time critical task and we did not have time to figure out where the missing server went, I created a PHP script that retrieves the email addresses from the oracle database and sends the message to all of them one by one using the mail() method in PHP. This when tested worked great. I estimated the script runtime to be 4 hours. I started the script at 630 PM and came back next morning and it was still running!

The script had already sent 50 – 300 messages to each recipient by then. To stop the php script I stopped the PHP page which was running all night. This did NOT stop the script in the backend. To stop the mailing process I went through google and figured out that the mail() process used the sendmail service on the redhat server. So I manually stopped the sendmail service which stopped the mails from sending.

I then asked the network administrator to check if there were any mails in the SMTP server queue. There were 1000s of messages in queue. I requested him to stop all the messages in the queue but the messages with RETRY status could not be stopped. There was a unapplied patch that needed to be run to make this work. At least the mails was not sending the mail but stuck at the queue.

After 3 days when I restarted the sendmail service on the server, it resumed sending the messages to all the recipients in the queue. This increased concerns. I then researched a little bit more and came to know that the sendmail service in redhat has its own mail queue which can be viewed using the mailq command in linux or using the sendmail –v -q command. There were 181k messages in queue waiting to be sent. All the queueud messages were stored in the folder /var/spool/mqueue. (Reference)

To delete all the messages in queue I ran the command rm /var/spool/mqueue but this didn’t work and gave me an error. “/bin/rm: Argument list too long” . This was probably due to the limitations of rm command to have a length of arguments as a max of 1024. The alternate command (reference) to delete all the 181k files is “find . -name ‘*’ | xargs rm”. This deletes all the files in the current directory regardless of the number of files.

This way the spooled messages were cleared from the queue but the problem the database procedure not able to access the mail server still existed. So I went through various google articles referring to the ORA-24247 error. I came to know that this error was due to an extra security layer in oracle 11g. There is an XML table in  the oracle 11g database that tells the packages about particular accessible server ips. This article helped me out in understanding the extra security ACL list information and updating it to serve my purpose.

This way I fixed the oracle script access to the smtp server and solved the mailq problem with the e-blast messaging project. Just shared it to help others fix this problem if the come across it.