Production Server filling up with Redo Logs resulting in space issue. Deep research!

Situation: We have a production portal which connects to our Oracle  database to retrieve some data. We have been getting ORA-00257  archiver error on the portal quite often which makes us look bad. Most of the servers are UNIX based. One of our teammates has been looking at the site and letting us know of this errorso that we fix it before some user looks at it.

Reason for the error: This error usually shows up when the partition or hard disk which is used by Oracle for writing the redo logs is full.

How to check: Connect to the Oracle DB server and check the disk space on it using df -h in case of UNIX. Pretty obviously the partition which has 100% used is the hard disk that Oracle uses to write the redo logs. In case there are more than one partitions that show 100% usage, you need to figure out the redo log writing location from the show parameter DB_RECOVERY_FILE_DEST command in Sqlplus.

After figuring out the destination or location where Oracle is filling up the redo logs you can backup/move some of the older redo logs to a different location to free up space for the issue to be solved immediately.

Not solved yet?: In our case, Oracle kept filling up the redo log files again and again and we had to clean it up again. We went through various articles on oracle errors to figure the root cause of this issue. We first doubted the Tape Drive because we had a recent tape drive failure and this might be linked to it.

We looked into various stored scripts on the server which ran on a daily basis and looked at the logs. Looking at the logs for the RMAN scripts on the database server pointed us toward the RMAN-00571 and ORA-19502 errors which were related to space issues too.

These scripts were written probably by an previous DBA.They archive all the redo logs and deletes them at the end of the day. These were not able to successfully complete due to space not being sufficient on the hard disk.

Solution: Space clean up and making sure the archiving and deleting process are running properly at the end of the day so that the next redo log writing process has enough space to use.

Conclusion: Don’t just look for a direct solution to a problem in the IT field. There might be more than one cause for a single problem. One problem leads to another and then to another.

Setting up the simplenews module for bulk email sending in drupal

In the past, I had a very bad experience coding a PHP script to send bulk e-mail. This time we needed a newsletter sending module for the drupal site to send around 55,000 newsletter emails.

For this I found the simplenews module on the drupal site that served the basic purpose of a newsletter sending application. As per our needs, I also installed the other modules/plugins like simplenews template, simplenews statistics and mime mail which enhanced the features of the simplenews module.

The modules, plugins and other requirements:

The simplenews module: This is the core module needed for the newsletter functionality

New Content type: Create a content type for the simplenews module to use

The simplenews template module: This was needed to add custom header and footer to content type created for simplenews on drupal so that we could send a particular webpage/node (special content type for simplenews) as the newsletter.

The statistics module: This was needed for reporting purposes. It gives the information about how many people clicked the links on the mail and viewed the email etc..

Mime mail: This module helps in sending HTML mails and is needed to be used with simplenews template.

The installation and configuration:

First I installed the simplenews module and then the simplenews template and statistics (no specific order needed) then the mime mail module to send out HTML format emails.

Created a content type for the simplenews module and setup the taxonomy and pathauto settings for the terms for the content type so that once a node or page is created the link for it is properly generated.

The installation of “mime mail css compressor” needed the DOM php plugin installed so I did a yum install on the php plugins on the linux server and it updated the needed plugins for installation of css compressor.

Once the installation of all the required modules were done. I configured the simplenews module. One of the most important settings was to setup the mail send process to be run using the CRON job, and to setup such that it send more messages per cron run that doesn’t let the cron timeout. The max number of messages per cron job that I could successfully send were 200. setting it to 500 failed the cron run due to timeout.

Our server was setup to run cronjobs every one hour. This was very unfavourable for the simplenews module to send 55,000 messages. It would take 275 hours to finish the job in this case. Therefore I did a little research and came across ELYSIA cron module which splits all the cron jobs in drupal and lets us setup different cron run timings for different modules.

I changed the crontab drupal maintenance period to * * * * * and then in the elysia cron module settings changed the schedule for the simplenews module to be run every 5 minutes (*/5 * * * *) . Now the simplenews module would take only 23 hours to finish sending all the messages. (Each cron run was taking on an average 5 minutes to complete hence the time was selected as 5 mins)

Now the send process of the mail takes its time to send all the messages without crashing the smtp server and without backfiring and sending multiple mails to same user. This is the solution to sending bulk newsletters without crashing anything.

If you have any questions, please feel free to ask me.

Coldfusion 7 MX cannot understand oracle sometimes

Like every Monday,  I came into work today and got a complaint about our portal’s coldfusion business application not working.  The coldfusion applications were built on Coldfusion 7 MX and the backend was Oracle 11g. The exception on the portal look as below:

Portal Exception
The Coldfusion Exception

After 30 minutes of troubleshooting, I went into the coldfusion admin page and noticed that when verifying the oracle connection it gave an “Internal error: Net8 protocol error” message. Later on googling a bit, I understood from this forum post that the exception was because coldfusion 7 MX was unable to understand the “password about to expire” signals from oracle for the user that coldfusion uses to connect to oracle. So we reset the password for that user in the oracle database using the alter command and the problem was fixed.

Thanks to Google and the posted solution this production issue was solved within an hour.

Fix to show the hidden content behind the overflow scrollbar in Internet Explorer

One of my web applications at work was having an issue of displaying the content completely in the internet explorer.

We had our coldfusion fusebox applications embeded into a PHP template and I had the style set to display only the horizonal overflow scrollbar i.e. “overflow:auto; overflow-y:hidden” . This works fine with Firefox but when we open the same in Internet explorer we have an issue. The scrollbar hides some content behind the itself.

The solution to this problem in IE, as mentioned in the referenced link, is to add around 20 pixels of padding at the bottom of the page.

This fixed my problem and works like a charm. Hope this helps others too. Thank you to the guy who wrote the referred link.

sendmail service queue clearing and ORA-24247 error fix using instructions to add ACL info.

The other day at work we had a request from the client to send an e-blast (mass e-mail) to all the email addresses listed in our database (~55k emails).

In the past this was done using a procedure in the oracle database which used one of our smtp mail servers and oracle UTL_SMTP package. But due to a missing smtp mail server we could not run the procedure. We tried changing the mail server ip from a non-working one to a working one, but this didnt work. We kept getting an error that said

ORA-24247 network access denied by access control list (ACL)

As this was a time critical task and we did not have time to figure out where the missing server went, I created a PHP script that retrieves the email addresses from the oracle database and sends the message to all of them one by one using the mail() method in PHP. This when tested worked great. I estimated the script runtime to be 4 hours. I started the script at 630 PM and came back next morning and it was still running!

The script had already sent 50 – 300 messages to each recipient by then. To stop the php script I stopped the PHP page which was running all night. This did NOT stop the script in the backend. To stop the mailing process I went through google and figured out that the mail() process used the sendmail service on the redhat server. So I manually stopped the sendmail service which stopped the mails from sending.

I then asked the network administrator to check if there were any mails in the SMTP server queue. There were 1000s of messages in queue. I requested him to stop all the messages in the queue but the messages with RETRY status could not be stopped. There was a unapplied patch that needed to be run to make this work. At least the mails was not sending the mail but stuck at the queue.

After 3 days when I restarted the sendmail service on the server, it resumed sending the messages to all the recipients in the queue. This increased concerns. I then researched a little bit more and came to know that the sendmail service in redhat has its own mail queue which can be viewed using the mailq command in linux or using the sendmail –v -q command. There were 181k messages in queue waiting to be sent. All the queueud messages were stored in the folder /var/spool/mqueue. (Reference)

To delete all the messages in queue I ran the command rm /var/spool/mqueue but this didn’t work and gave me an error. “/bin/rm: Argument list too long” . This was probably due to the limitations of rm command to have a length of arguments as a max of 1024. The alternate command (reference) to delete all the 181k files is “find . -name ‘*’ | xargs rm”. This deletes all the files in the current directory regardless of the number of files.

This way the spooled messages were cleared from the queue but the problem the database procedure not able to access the mail server still existed. So I went through various google articles referring to the ORA-24247 error. I came to know that this error was due to an extra security layer in oracle 11g. There is an XML table in  the oracle 11g database that tells the packages about particular accessible server ips. This article helped me out in understanding the extra security ACL list information and updating it to serve my purpose.

This way I fixed the oracle script access to the smtp server and solved the mailq problem with the e-blast messaging project. Just shared it to help others fix this problem if the come across it.

CA Arc serv backup agent for oracle

The other day I was helping out the network engineers install an Arc Serv Backup agent on one of our servers for them to manage backups and restores for that server. I installed the Arc Serv agent for the server on the Red Hat Linux server without any problems but when it came to installing the Oracle backup client, it overwrote all the oracle environmental variables. thus ending up in crashing the whole production database.

It initially didn’t give any problem. the problem became prominent only after a few minutes when our customers were having trouble. I and the team then worked on the troubleshooting of it. And in the end we figured it out and manually setup the Oracle_home variable to fix everything on the oracle production server.

Thank god we were able to figure out the source of the problem in a short time, it took only 1 hour to startup the oracle server back. I got a chance to explore Oracle setup to fix this issue.

“We learn from our mistakes, so we have to make some mistakes” 🙂