WordCloud is a technique to show which words are the most frequent among the given text. Now you will start to dive into the main course of the meal: WordCloud. That's a little bit of data exploration to get to know the dataset that you are using today. If you notice, Portugal ranks 5th and Australia ranks 9th in the number of wines produces in the dataset, and both countries have less than 8000 types of wine. Let's now take a look at the plot of all 44 countries by its highest rated wine, using the same plotting technique as above: plt.figure(figsize=(15,10))Ĭountry.max().sort_values(by="points",ascending=False).plot.bar()Īustralia, US, Portugal, Italy, and France all have 100 points wine. Italy also produces a lot of quality wine, having nearly 20,000 wines open to review. plt.figure(figsize=(15,10))Ĭountry.size().sort_values(ascending=False).plot.bar()Īmong 44 countries producing wine, US has more than 50,000 types of wine in the wine review dataset, twice as much as the next one in the rank: France - the country famous for its wine. If you are not familiar with Matplotlib, I suggest taking a quick look at this tutorial. You can plot the number of wines by country using the plot method of Pandas DataFrame and Matplotlib. There are 44 countries producing wine in this dataset such as Italy, Portugal, US, Spain, France.ĭf].head() There are 708 types of wine in this dataset such as White Blend, Portuguese Red, Pinot Gris, Riesling, Pinot Noir.
There are 129971 observations and 13 features in this dataset. You can printout some basic information about the dataset using print() combined with. # Load in the dataframeĭf = pd.read_csv("data/winemag-data-130k-v2.csv", index_col=0) Notice the use of index_col=0 meaning we don't read in row name (index) as a separated column. Now, using pandas read_csv to load in the dataframe.
If you have more than 10 libraries, organize them by sections (such as basic libs, visualization, models, etc.) using comments in the code will make your code clean and easy to follow. Now let's get started!įirst thing first, you load all the necessary libraries: # Start with loading all necessary librariesįrom wordcloud import WordCloud, STOPWORDS, ImageColorGeneratorĬ:\intelpython3\lib\site-packages\matplotlib\_init_.py: This collection is a great dataset for learning with no missing values (which will take time to handle) and a lot of text (wine reviews), categorical, and numerical data.
This tutorial uses the wine review dataset from Kaggle. However, the latest version with the ability to mask the cloud into any shape of your choice requires a different method of installation as below: git clone
If you only need it for plotting a basic wordcloud, then pip install wordcloud or conda install -c conda-forge wordcloud would be sufficient. Wordcloud can be a little tricky to install. You will need this library to read in image as the mask for the wordcloud. Pillow is a wrapper for PIL - Python Imaging Library.
The pillow library is a package that enables image reading. To read more about handling files with os module, this DataCamp tutorial will be helpful.įor visualization, matplotlib is a basic library that enables many other libraries to run and plot on its base including seaborn or wordcloud that you will use in this tutorial. The Python os module is a built-in library, so you don't have to install it. It is also used in combination with Pandas library to perform data analysis. The numpy library is one of the most popular and helpful libraries that is used for handling multi-dimensional arrays and matrices. You will need to install some packages below:
How to create a basic wordcloud from one to several text documents.In this tutorial we will use a wine review dataset taking from Wine Enthusiast website to learn: This tool will be quite handy for exploring text data and making your report more lively. For this tutorial, you will learn how to create a WordCloud of your own in Python and customize it as you see fit. Many times you might have seen a cloud filled with lots of words in different sizes, which represent the frequency or the importance of each word.