Aalaya Seshadri
3 min readFeb 25, 2021

--

Data Science — An art

From my experiences in dealing with Data Science projects, I can assure you that AI is not all about knowing or plugging in different algorithms. It is more beautiful than that, the creative part of AI is often overlooked.

Here are some tips and tricks from some of my previous projects which can be reused if it suits your needs.

Image Classification:

We all know that the image classification problem is solved using Transfer learning applying architectures like VGG, Resnet, Inception, etc. Sometimes we might have also been in situations where no amount of hyperparameter tuning might have helped us in achieving our objective metric.

I faced the same while I was working on medical images, I had a lot of untagged dataset.

After trying the routine architectures above, We did something different,fed all of the untagged and tagged data(barring test dataset) to autoencoders for dimension reduction of the images and then applied traditional logistic regression and the results were exemplary, because VGG, Resnet, and others are trained on Imagenet(Cats, dogs etc.) What you need here is custom image2vec which is very specific on medical images.

NLU(Natural language Understanding):

All data scientists can collectively agree that data preparation and tagging is a tedious task.

We were working on a six-class text classifier(which has a hierarchy), we collected the required data from all the medical data sources available on the net. The first level of separation into two prime categories is done during collection itself for portion of data, we used this portion of the data to develop a binary classification model which is later used to do the primary separation for all the remaining data(AI tagging here). Later on, human interference was required to further categorize the data. In short, we used AI and humans to do the tagging as and when required. If you were to take this approach ensure that the AI tagging is super accurate.

OCR:

Have you ever used OCR services from Cloud providers? If you have worked on them, you know that data written in continuous boxes more often than not is misinterpreted.

It recognizes box boundary as ‘1’, as in for the below image it sometimes gives 10981.6 instead of 098.6

What can be done to avoid that? You can mask the corresponding contours using OpenCV(like below) and then give the image to OCR so that extraction happens smoothly.

Power of Domain Knowledge and Feature Engineering:

There is nothing paramount than domain knowledge and common sense.

Feature engineering is the heart of Machine learning.

Problem statement: Fraud detection — Shopping app.

This app provides points to the customers based on their purchases, product scans, and walkins done at a store thereby incentivizing their purchasing behaviour. There are a lot of fraudsters who are forging receipts, faking walkins, and product scans affecting the app revenue.

Walkin fraud- Geospatial analysis- walkin location of the fraudster is usually different than that of the regular user although within the permissible limit and also the day and time of walkin is different (usually non-peak hours)

Receipt fraud- Focus on product family id combinations. Fraudsters resubmit the same receipts just changing the date, append some products at end.

Product scan Fraud- Fraudsters take a product barcode picture from a computer(provided by fraud websites) instead of actually clicking a picture from the store- The image classification algorithm focussing on the background can address this.

Invite fraud- Highly unlikely that all the invitees accept the invite on the same day the inviter has sent it, this feature helps us weed out fake invites.

Sometimes even demographic features like age, gender help us in detecting fraud better.

Once you identify these patterns all you have to do is to come up with numerical representations(distance, number of invites, frequency between product scans, time taken to accept invites, etc.) and feed them to the ML algorithm.

--

--