Tech Blog

The importance of having a good dataset

The importance of having a good dataset

… and how to actually get a good dataset?

In this post, I will discuss the significance of datasets, and how to make a good textual dataset.

What is data?

Data is, to put it simply, another word for information. But in the context of computing and business, data refers to information that is machine-readable as opposed to human-readable.

Human-readable vs Machine-readable

Machine-readable data, or computer-readable data, is data (or metadata) in a format that can be easily processed by a computer, while a human-readable medium or human-readable format is a representation of data or information that can be naturally read by humans. Machine-readable data can be automatically transformed for human-readability but, generally speaking, the reverse is not true. An example would be a PDF document, which is generally easy for us to understand, but typically are difficult for machines to interpret.

Datasets in machine learning

In my humble experience working as an AI Developer, one of the biggest problems that I encountered was -besides actually building a model for a specific problem- having a good dataset that properly relates to the problem at hand. Besides that, the dataset has to be processed in a way so that our model can make sense of the information. That way the model can successfully learn from that dataset.

Quality > Quantity

It is of great importance that the data is right for the problem we want to solve. It doesn’t matter if we have terabytes of data if the data isn’t aligned with the problem. We are trying to find data with the features that matter to what we’re trying classify or predict and discard unrelated features. The first step should be proper data collection, and until we achieve this, we will find ourselves constantly coming back to this step.

dataset

while ( !dataset.isRight() )
{
	dataset.collect( new Data(“some magical place”) );
}

The dataset should have all of the useful features stand out. For example, if we were to make a model whose job is to detect where the person is on an image, our dataset should consist of images which contain people for who we know their exact location on the image.

After successfully collecting data, it should be converted into a format that our model understands. In general, the input data, whether it’s texts, images, videos, or sounds, is turned into vectors and tensors to which linear algebra operations can be applied. The data needs to be normalized, standardized and cleaned to increase its effectiveness. Until then, the data is called Raw data (unprocessed data).

Collecting data

There are two ways to gather data:

  1. Collecting it from a primary source yourself, where you are the first one to obtain the data
  2. Collecting data from a secondary source, where you would be re-using data that has already been collected by other sources, such as data used by certain scientific research

Preprocessing raw text data

In this section, I am going to describe the usual preprocessing steps for preparing generic textual data. These steps make it possible for us to get the most useful information out of a text dataset.

Tokenization

Tokenization splits longer strings into smaller ones, called tokens. We tokenize the text into sentences, the sentences are then tokenized into words. Tokenization is usually the first step in preprocessing text. This process is also referred to as text segmentation or lexical analysis. Sounds easy, right? It isn’t as straightforward as it seems. Let’s take sentence tokenization as an example. The first thing that comes to my mind is to split the text by punctuation marks to achieve this. It could work for some examples:

The quick brown fox jumps over the lazy dog.

But we couldn’t say the same for the following case:

Please arrive by 12:30 p.m. sharp.

What about word tokenization? If we only considered splitting the words by whitespaces, instances like she’s wouldn’t be tokenized correctly. This apostrophe brings up an interesting question –  how to approach this? Is it actually important to memorize where punctuation is placed, or should we drop it altogether?

Removing stop words

When processing natural language, we often filter out stop words. These lists usually refer to most common words in a certain language, such as ‘the’ in English. It is important to note that there isn’t a universal list of these words. Some tools can even have different stopword lists. Stop words are usually deemed irrelevant and are often dropped from the text, which helps us to ‘clean up’ the data.

Stemming and lemmatization

Because of grammar, we will often find different forms of a word, like organize, organizes, and organizing. It is often useful to reduce the number of different forms of the same word by replacing them with a common base form. This is achieved by both stemming and lemmatization of the words. By doing this we make the data more consistent.

For example, this is the result we would want to achieve for these words:

 

am, are, is → be

cat, cats, cat’s, cats’ → cat

 

So, how do we exactly achieve this? Stemming is a process that usually crudely chops off the end of the word, hoping to achieve the desired goal most of the time. Lemmatization however usually refers to doing this properly, with the help of vocabulary and morphological analysis of words, aiming to return the base of the word, known as a lemma. If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun.

There are many algorithms, both commercial and open-source, which do exactly this. Stemmers use language-specific rules and are easier to implement than a lemmatizer, which requires a complete vocabulary and morphological analysis to correctly lemmatize words. Some tokens may require special rules to be implemented as well.

For apparent reasons, this process differs from language to language.

Data normalization

Normally, raw data consists of attributes with varying scales. Data normalization is the process of rescaling one or more attributes to the range of 0 to 1. In general, it is a good technique to use when you don’t know the distribution of the data, or when you know the distribution is not Gaussian.

Further noise removal

Depending on where you get the data from, you would have to make additional steps to clean the data up. Let’s take tweets from twitter as an example. ‘Noise’ in this case would be links, user mentions or ‘RT: ’.

Instances where machine learning models got ‘corrupted’ by bad data, or lack thereof

Here are some popular examples where models were led astray by bad data.

         Google Photos labeled black people ‘gorillas’ [source]

Google’s automatic image labeling system wrongly classified African American as “gorillas.”

         Twitter taught Microsoft’s AI chatbot to be a racist asshole in less than a day [source]

While engaging with other users, the chatbot repeated a lot of problematic sentiment from toxic users.

         Researchers Fooled a Google AI Into Thinking a Rifle Was a Helicopter [source]

Researchers intentionally targeted the algorithms specific problem called “adversarial example” to trick the AI.

Conclusion

When building a dataset, it’s important to know what features to look for, so that the model can get the most benefit out of it. After acquiring the data, we need to do noise removal, to make the useful features stand out.

About Author:

Naida Agić is software developer in BPUE and she is working under AIML team. She is passionate about machine learning, responsible and dedicated, goal oriented and fearless in problem solving. One of her biggest passion is mathematics as she is one of the talents at the Faculty of Natural Sciences and Mathematics in Sarajevo.

Comments (1)

really good article, it spot light on an issue a lot of us miss.

Leave a comment

Zimgo