How to detect Google Suite hosted email address

by Martin


Posted on October 12, 2020


Email profiling and detection of email hosting provider is useful for understanding customers. When email address uses Google Suite, it can mean it is either a premium paid service or hosted for an academic institution.

There is a potential that an email address is hosted at Google services since times when “Google Apps for your domain” were offered free. In such case it is not a premium-paid email service. However, it cannot be detected using publicly available information.

Every domain that has ability to receive e-mails, has MX records in its DNS records. In case of a domain Google Suite, the records point to Google email servers. If email address is john@example.com and example.com uses GSuite, an MX record could look something like this:

{
      "host": "wsj.com",
      "class": "IN",
      "ttl": 1800,
      "type": "MX",
      "pri": 1,
      "target": "aspmx.l.google.com"
}

There we can specifically see the target server address aspmx.l.google.com. This server physically receives all the emails sent to addresses ***@example.com.

To access data about MX records, we need service that can query DNS records. For this purpose, we can use our sample api:

http://sampleapis.pricepit.net/dns/?email=john@example.com 

Now, let us put it all together in Decisimo decision engine

1. Create sample Data object

We want to create sample data object, so that we can test our rules simply and quickly. There we will add attribute $.email with dummy value journalist@wsj.com (they run on GSuite).

2. Create a data source

We can create data source by filling in configuration, or just use the template available and create it from template.

New data source from template
Selecting the template

3. Create workflow

In new workflow, we add three steps – start, external data, rule set and connect them in this sequence.

Workflow with all the steps

Start is indication for the engine, where it shall start executing. External data is a step holder that calls in parallel defined data sources in options part of step. Rule set is a holder for selected rule set.

4. Defining external data step

Next, we add option for external data – the DNS check we have just added. We know the check expects parameter email to contain email address. We define that parameter will be filled with value from the data object attribute $.email.

Options of worklflow step for external data
Defining external data call parameter - option of step

5. Intermediate release & test

Now we can release and test the workflow. Release will create snapshot of current workflow and all parts of it. This way we can test how the current flow behaves during execution.

Message showing that release was completed

On release page on the left, we can select from drop down list our testing data object, click load and a JSON object with our email attributes shall appear below.

Once data object is loaded, we can hit Test release button, wait a while, and see result of the mock run. When test is finished, we see on the right initial JSON enriched with data collected by the engine along the way.

In the result, we can see what the external data loaded related to DNS check. In path output – check – dns-check – mx, there is an array of records that were found related to MX of the email address.

Results of mock run of sample json against release

If there are no records in mx, the domain does not receive emails. Sometimes there is only one, sometimes there are 10.

We will need to write JSONPath to get those MX details for our rule. I recommend you to copy the JSON response/result. If you are not proficient in writing JSONPath, you can use websites like: jsonpath.com.

6. Create rule

Now that we know, what data we can expect, we create rule set to find out whether an email is Google hosted or not.

In our rule, we will evaluate content of $.output.check.dns-check.mx[0].target. Since gmail users usually have several records, we will not be matching exactly. We will consider that if the target end with aspmx.l.google.com, it is google hosted.

Rule for detecting Gsuite - evaluating "ends with (case insensitive)" operation

We will create two rules – one for detecting it is google and one for negating – not Google GSuite.

Once we have that, we shall update workflow step for rule set. Set our rule set in the step, save and test&release.

7. Testing, Endpoint creation and deployment

Now we can see whether it detects properly. Ideally wsj.com should be detected as “is google”, anything else will be no.

If our testing works, we can set up endpoint. You should create your endpoint on instance physically as close your servers as possible. If you are in Europe, use European, if in Asia, use Asian. Different region can add several hundred milliseconds to response.

API Endpoint for deploying decision rules

Once you have the endpoint ready, go back to releases. Select last release that was working for you. Test it out again (just to make sure). If everything works, at the bottom of the page, select your endpoint and click deploy. Once info message appears that it was deployed, you can start sending requests to the api endpoint and it will contain the workflow you have just tested.

8. Test out your endpoint using postman

Copy the JSON from testing, copy the url for endpoint, fire up Postman and get running.

In postman, select that you want to send using method POST. Paste url of the endpoint into request url. Click on body, select Raw and then further right click on text and select JSON. Paste the JSON with email below and hit SEND.

If everything goes right, you should be receiving the same response as when testing in the portal.

Results from testing REST API of the endpoint using Postman

9. Integrate into your system

If you have everything running right, integrate the endpoint into your system and enjoy being able to update rules and deploy quickly changes to endpoint. Endpoint stays, the rules in the workflow can change as quickly as you want them to.

Happy decisioning!


Email address profiling

by Martin


Posted on October 01, 2020


Email profiling can work as great tool for assessing quality of a customer for many industries. In fintech, or insurance, it can be predictive for credit risk management or actuarial machine learning models.

To each part of email address, there is a different way to profile. Whereas username gives insights about choices, transparency, nicknames, randomness, the domain (part behind at) can provide a lot of useful information.

Profiling username come down to potential feature creation that can be used in scoring and detecting gibberish email addresses. Feature creation can focus on these:

  • First name present
  • Last name present
  • Number present
  • First name only
  • Levenshtein distance from first name
  • Levensthein distance from last name

All these are interesting indicators but cannot be used as good/bad – they can come with other attributes into predictive modeling.

Domain name profiling comes down to few things

  • DNS information (MX records, domains)
  • Disposable/Junk/temporary mail detection
  • Email provider

Some of the rules for email address profiling based on domain name are straightforward. If the email is not deliverable, no bother sending it. Deliverability of an email can be assured by checking WHOIS records and MX DNS records. MX Records are like post code – saying to which location to send the emails physically. If there are none, then email cannot be delivered. No if or buts. WHOIS is telling you whether the domain exists and who is owner. If the domain does not exist, no bother sending an email.

Then it comes to junk or temporary email detection. If someone provides you with temporary mailbox like Guerrilla Mail, it is highly questionable whether they want to have any long-term relationship with you and your business. Services like that are usually used by fraudsters because they are easy to use and you can get as many different addresses as you want.

Subsequently, there is email provider detection. There are differences in free email like Gmail and business email hosted on G Suite. You can score identification of for example:

  • Free email provider (gmail, Hotmail, …)
  • Education institution
  • Business Outlook
  • Business Google Suite
  • Generic webhosting email provider
  • Self-hosted Outlook
  • Other own email server

The type of service handling email can tell you a lot. Whereas there are no costs related to free email, there are more costs with cloud solution for companies, and managing own Outlook server requires expensive license and IT support. As such, it can be useful predictor for your models.

Lastly, there is an interesting thing called catch-all email address. Sometimes people buy a domain name and then redirect all incoming emails to their own mailbox. This can be negative, or positive. Negative if done by fraudsters who want to reuse as many domain names as possible in quick period. Positive if someone has bought own domain name, but do not know how to handle all the stuff related to own mail system management.


Face recognition and comparison for onbaording

by Martin


Posted on September 30, 2020


Face recognition has made big waves in recent years. Waves which were made by breakthrough improvements thanks to deep learning. The changes help with making identity verification at a distance a reality. Face comparison against identity documents, which used to be done manually by humans, is moving into the online realm.

Humans are good at recognizing and comparing faces, thanks to the brain areas dedicated for this task. In addition to face recognition, brain does more actions related to identifying people. Those range from analysis of clothes, gender, location, context recognition to analysis of walking and moving style.

Making artificial intelligence achieve comparable results on whole person recognition would need to encompass multiple ways, so for now let us focus only on the limited part of face recognition. Doing face recognition in the online realm requires approach that goes with validating input, running models, and performing decision making based on data collected.

Photos used for decision making in face recognition are of these types:

  1. Simple user-uploaded photo,
  2. Selfie taken at the time of recognition process,
  3. Photo taken during liveness detection process.

The photo of a person is subsequently matched against other source. That can either be services by governments (such as NCIIC in China, Dukcapil in Indonesia), or photo of identity card (ID, Driving license, passport).

Data input validation

Major ways in validating veracity of digital photos are

  • Error level analysis (more),
  • EXIF metadata Analysis (more),
  • Last saved quality.

Only two of those can be done well automatically without running into too many false positives. Error level analysis has the issue of needing someone check the results, because it is more of a visual tool. Also, it fails at detecting some photos manipulated in smart, yet simple ways. For example, screenshot of a manipulated photo will usually not be detected as manipulated by ELA.

Metadata analysis contains several useful information ranging from camera used, timestamps, location of certain objects in photo and sometimes even geo location. This can be helpful, when you want to make sure the photo was taken at the correct location (point of sales?), not too far back (within minutes before the event of analysis), and was not manipulated (photoshop of other software is not present in metadata). If metadata are stripped, it should be big warning. However, if you are missing metadata on all photos, visit your software developers to find out how photos are taken, changed and stored.

Last saved quality is related to compression as usually stored and manipulated photos are compressed and are not in full quality as when taken by device.

Most of the problems with getting fake photo can be overcome by using liveness detection in the process. That usually requires having in the process mobile app that runs liveness detection algorithms and in between of the process takes a photo that is sent for recognition. The potential attack vector at this moment is API that uploads photo to the server. To decrease risk, it is recommended to use not only standard hashing + encryption, but also other security by obscurity methods to harden security of API endpoints against receiving fake data.

Running comparison

Once we are sure about data we are receiving, next is to run comparison. Apart from government services, which are usually performing well (rule of thumb is that the more authoritarian a country, the better the face recognition algorithms). To train own model is in these days is almost irrational due to the amount of data necessary to train the deep learning models and due to the advancement of services, which are quite cheap.

Selection of service can be done of several dimensions – price, speed, quality of comparison. I would recommend also being selective based on the racial profile of people being recognized/compared. For example, whereas Microsoft services seem to be performing quite well on caucasians, they get sometimes absolutely lost when comparing Asian. For Asians, I have seen the best results from Face++. On caucasians they are sometimes off when it gets to detailed analysis of facial features.

Usually there are two services that I recommend for doing properly face recognition. One is analysis and one is comparison. Sometimes people just ran comparison without running analysis. Analysis can be used as a check on what is being compared. Sometimes algorithms can be off the mark – saying that someone is male even though it is obviously a female.

Final decisioning

Good process for face recognition decision making is:

  1. Data validation - Incoming data can be trusted
  2. Outlier/strange result check - Using results from analysis for “trouble detection”
  3. Final decision - Comparison of confidence results

Incoming data detection rules recommendation is as follow:

  • No Photoshop or other software
  • Camera maker matches phone make (from other e.g. browser metadata)
  • Geolocation is not positioned off
  • Photo is not too old
  • Image metadata are present

Photo analysis

  • Gender match
  • Just one person detected in the photo

Comparison

  • The comparison results are too high – e.g. 99%
  • Confidence mark based on recommendation by service vendor. Usually:
    • 80%+ high confidence of same person
    • 60-80% Some certainty
    • <60% not the same person

Final words…

Goal of this post is to explain in broad strokes the process of face recognition. I to do not claim it is comprehensive and if anything, should be at least the beginning, but not the end of policy/strategy for decision making in identity verification. The problem of human identification is a multifaceted one. Doing face recognition and comparison when being focused on just running some cognitive services and only setting cutoff can lead to stupid decision making done by believing in having great data.