As a data scientist, you should have a toolbox with some essential tools to perform data analysis and model development. Due to the popularity of Python in data science, an increasing number of analysts choose Python-related software as their primary analytics tools. In this article, I introduced a popular toolbox for basic data analysis and Python development tasks and provided a step-by-step installation tutorial. This toolbox is equipped with Anaconda for module management, Jupyter Notebook for interactive data analysis and Pycharm as a typical IDE (outstanding Debug function) for Python development. This article is based on Windows operation system, for Linux tutorial please refer to the next article Data scientist essential tools setup(Clusters)
1. Install Anaconda
1.1 Visit Anaconda homesite and download the Windows installer as the following picture. You are encouraged to download the Python 3.7 version instead of the Python 2.7 version. Select the 64-Bit Installer if your system type is 64-Bit.
1.2 Install Anaconda on your machine step by step.
1.3 Once you complete the installation, take a look at the desktop. If there exists a green circle icon for Anaconda Navigator on the desktop, just double click that icon, otherwise, type in ‘anaconda navigator’ in the Windows search bar and you will find the Anaconda Navigator icon.
1.4 Open up the Anaconda Navigator and create a virtual environment for your first project. Do not use the base(root) environment, because you may need different versions of modules in different projects. Make sure to create a virtual environment for each project. The process of creating a virtual environment in Anaconda Navigator is displayed in the following picture.
step1. Select Environment
step2. Select Create icon on the left bottom
step3. Type in the name of the virtual environment
step4. Select Python 3.7
step5. Click Create
1.5 Install essential modules(packages) in your virtual environment. The default installed modules in your virtual environment are very limited and not sufficient for data analysis. You can download and install modules as shown in the following picture. Make sure you have accessed the new virtual environment by clicking its name.
step1. Select all not yet installed packages
step2. Type the package’s name(I type pandas here)
step3. Check the small box of that package you want to download
step4. Right click the package and select the last item on the menu(Mark for specific version)
step5. choose the version you need
step6. Click Apply
1.6 So far we have successfully set up the virtual environment, let’s begin to program with Jupyter Notebook. Shortly speaking, Jupyter Notebook is an interactive environment that is a very helpful tool for the data science project.
step1. Click Home
step2. Choose the virtual environment you need
step3. If this is the first time you use Jupyter, you need to install it first in the Anaconda Navigator, click Install
step4. Once the Jupyter is installed, you can click the Launch button to open up the Jupyter Notebook(this step is not shown below)
1.7 Open Jupyter Notebook and create a new project.
step1. The folder structure is shown on the main page of the Jupyter Notebook. Select the place where you want to put your new notebook project
step2. Click New button on the top right corner and select Python 3
2. Install Pycharm
2.1 Visit Pycharm download page and download Community version.
2.2 Install Pycharm as shown below.
2.3 Open Pycharm by clicking Pycharm icon on the desktop and set up the environment for Pycharm as follows.
2.4 Select Interpreter for Pycharm(Important!)
If you want to link your Pycharm project with the existing virtual environment you have created in the Anaconda Navigator before(This scenario is most common in the daily work), you should follow these steps.
step1. Name your project and choose the place to save your project
step2. Select Existing interpreter
step3. Click the button to choose the existing interpreter
step4. Select Virtualenv Environment
step5. Click the button to an existing virtual environment interpreter
step6-11. The window on the right picture displays the file/folder tree, you can find the target virtual environment and regarding interpreter in this window. For example, the newly created virtual environment above in Anaconda Navigator named ‘new_environment’ is under the following path: ‘D:/anaconda3/envs/’. The Python interpreter ‘python.exe’ for this specific virtual environment is under the path ‘D:/anaconda3/envs/new_environment’.
step12. Click the Ok button (not gray anymore) then you will open the HelloWorld project in Pycharm
step13-16. So far you have successfully opened the HelloWorld project in Pycharm with the new_environment interpreter. Now you need to create a python file in the project as following steps
step17. test if the pandas package has been installed, it should be installed because we already download and install the pandas in the new_environment. If the interpreter of the new_environment has been successfully loaded, we should import pandas well.
step18. Click Run button
step19. If you get a message ‘Process finished with exit code 0’, you are all set!
Jupyter Notebook is a good interactive environment but not a good debug tool. When your project involves a lot of coding and extensive OOP tasks, Pycharm will be a better option. In conclusion, Anaconda + Jupyter Notebook+ Pycharm is what you need for data analysis!
3.1. In some cases, you can’t find the module(package) you want to use in the Anaconda. What should you do?
Ans: When you are not able to find the module(package) in Anaconda Navigator or Anaconda Cloud, you should reach out to PyPI instead. To understand the difference between Conda and PyPI, please check this website. However, how to download the module from PyPI and install the module in our virtual environment even though you could find the module in PyPI? The solution is very simple: Use pip to download and install the module in the folder of the virtual environment. You can easily crash it by the following steps:
step1. Open up Windows Command Prompt by typing cmd in the Windows search bar or open up Anaconda Prompt by typing Anaconda Prompt in the Windows search bar
step2. Access into the Scripts folder under the virtual environment (The pip command is in the Scripts folder, so if you want to use it you have to access into the Scripts folder first)
step3. Download and install the module by the following code(take module ‘mesa’ as an example)
pip install mesa
step4.(optional) You can check if the module ‘mesa’ has been successfully installed in the virtual environment ‘new_environment’ by using Anaconda Navigator. The left picture below is the screenshot of all modules whose names are relevant to ‘mesa’ before downloading ‘mesa’ with pip command. You will find all the modules are from Anaconda repository because the icons in the red rectangle are all green circles(Anaconda icon), and none of them exactly match the name ‘mesa’, which means there is no such module in Anaconda. So we have to use pip to download ‘mesa’ from PyPI. If you take a look at the right picture below, you will find an extra module exactly named ‘mesa’ with a green checkbox which means this module has been installed. This picture is the screenshot when you have installed ‘mesa’ with pip command. Also, notice that the icon right after the module ‘mesa’ is different. That is because this icon is the logo of PyPI