# Python - Cheat Sheet

```python
df.to_csv(filename) # Writes to a CSV file
df.to_excel(filename) # Writes to an Excel file
df.to_sql(table_name, connection_object) # Writes to a SQL table
df.to_json(filename) # Writes to a file in JSON format
df.to_html(filename) # Saves as an HTML table
df.to_clipboard() # Writes to the clipboard
```

### Defining, then Coding and Testing Immediately&#x20;

In reality, it is often more practical to define a cleaning operation, then immediately code and test it. The data wrangling template still applies here, except you'll have multiple **Define**, **Code**, and **Test**subheadings, with third level headers (###) denoting each issue, as displayed below.

![](https://846345873-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LagOeJ2nL90MQERwhxy%2F-Lh6G-cGy9Xxt-oDrzA3%2F-Lh6GHxxECEizM33BxOA%2Fimage.png?alt=media\&token=3c68c102-89a0-4860-9140-0dd92c8183c0)

## Saving and writing

&#x20;Remember, set `index=False` to avoid saving with an unnamed column!

## Selecting

```python
df[col] # Returns column with label col as Series
df[[col1, col2]] # Returns Columns as a new DataFrame
s.iloc[0] # Selection by position (selects first element)
s.loc[0] # Selection by index (selects element at index 0)
df.iloc[0,:] # First row
df.iloc[0,0] # First element of first column
```

We can select data using `loc` and `iloc`, which you can read more about [here](https://pandas.pydata.org/pandas-docs/stable/indexing.html). `loc` uses labels of rows or columns to select data, while `iloc` uses the index numbers. We'll use these to index the dataframe below.

```python
# select all the columns from 'id' to the last mean column which is called 
# fractal_dimension_mean (there are other cols afterthat)
df_means = df.loc[:,'id':'fractal_dimension_mean']
df_means.head()

# Using index numbers
df_means = df.iloc[:,:12]
```

Another useful function that we’re going to use is [pandas' query](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html) function.

In the previous lesson, we selected rows in a dataframe by indexing with a mask. Here are those same examples, along with equivalent statements that use `query()`.

```python
# selecting malignant records in cancer data
df_m = df[df['diagnosis'] == 'M']
df_m = df.query('diagnosis == "M"')

# selecting records of people making over $50K
df_a = df[df['income'] == ' >50K']
df_a = df.query('income == " >50K"')
```

The examples above filtered columns containing strings. You can also use query to filter columns containing numeric data like this.

```python
# selecting records in cancer data with radius greater than the median
df_h = df[df['radius'] > 13.375]
df_h = df.query('radius > 13.375')
```

### Groupby

## Describing

```python
df.shape() # Prints number of rows and columns in dataframe
df.head(n) # Prints first n rows of the DataFrame
df.tail(n) # Prints last n rows of the DataFrame
df.info() # Index, Datatype and Memory information
df.describe() # Summary statistics for numerical columns
s.value_counts(dropna=False) # Views unique values and counts
df.apply(pd.Series.value_counts) # Unique values and counts for all columns
df.describe() # Summary statistics for numerical columns
df.mean() # Returns the mean of all columns
df.corr() # Returns the correlation between columns in a DataFrame
df.count() # Returns the number of non-null values in each DataFrame column
df.max() # Returns the highest value in each column
df.min() # Returns the lowest value in each column
df.median() # Returns the median of each column
df.std() # Returns the standard deviation of each column

```

Useful exploratory methods:

```python
df_red.duplicated().sum()
df_red.isnull().sum()
```

## Cleaning

```python
# Data CleaningPython

df.columns = ['a','b','c'] # Renames columns
pd.isnull() # Checks for null Values, Returns Boolean Array
pd.notnull() # Opposite of s.isnull()
df.dropna() # Drops all rows that contain null values
df.dropna(axis=1) # Drops all columns that contain null values
df.dropna(axis=1,thresh=n) # Drops all rows have have less than n non null values
df.fillna(x) # Replaces all null values with x
s.fillna(s.mean()) # Replaces all null values with the mean (mean can be replaced with almost any function from the statistics section)
s.astype(float) # Converts the datatype of the series to float
s.replace(1,'one') # Replaces all values equal to 1 with 'one'
s.replace([1,3],['one','three']) # Replaces all 1 with 'one' and 3 with 'three'
df.rename(columns=lambda x: x + 1) # Mass renaming of columns
df.rename(columns = lambda x: x[:10] + "_2008", inplace=True) # Other example
df.rename(columns={'old_name': 'new_ name'}, inplace=True) # Selective renaming
df.set_index('column_one') # Changes the index
df.rename(index=lambda x: x + 1) # Mass renaming of index

pd.to_datetime
df.drop_duplicates(inplace=True)

# Example: drop columns from 2008 dataset
df_08.drop(['Stnd', 'Underhood ID', 'FE Calc Appr', 'Unadj Cmb MPG'], axis=1, inplace=True)

# Example: replace spaces with underscores and lowercase labels for 2008 dataset
# The strip() returns a copy of the string with both leading and trailing characters stripped.
df_08.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)

# Check to see if different values are not in our columns:
for phrase in asap_list:
    assert phrase not in df_clean.StartDate.values


```

```python
# confirm column labels for 2008 and 2018 datasets are identical
df_08.columns == df_18.columns

# make sure they're all identical like this
(df_08.columns == df_18.columns).all()

# return duplicated rows
df2[df2.duplicated(['user_id'], keep=False)]
```

### Fixing data types

```python
# convert 2018 cyl column to int
df_18['cyl'] = df_18['cyl'].astype(int)
```

#### Regular expressions

```python
# Extract int from strings in the 2008 cyl column
df_08['cyl'] = df_08['cyl'].str.extract('(\d+)').astype(int)
```

{% file src="<https://846345873-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LagOeJ2nL90MQERwhxy%2F-LfXCAJ4RL8mM8y41pm-%2F-LfXCUSVQ2ubCLP6g3lm%2Ffix_datatypes_air_pollution.html?alt=media&token=85600357-0f0e-457c-91a5-4443b5c82bef>" %}
Splitting entries into 2 columns
{% endfile %}

### Renaming Columns

The following helps us determine duplicated columns in different tables

![](https://846345873-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LagOeJ2nL90MQERwhxy%2F-LhDCdx6ZyQzpJ6b6V_5%2F-LhDhL0oPbBYjSRir412%2Fimage.png?alt=media\&token=593f9d41-7d72-4a8b-9eeb-dde78156e487)

Helpful script

```python
# remove "_mean" from column names
new_labels = []
for col in df.columns:
    if '_mean' in col:
        new_labels.append(col[:-5])  # exclude last 6 characters
    else:
        new_labels.append(col)

# new labels for our columns
new_labels
```

Tidiness

```
To  e deleted

for tweet in tweet_id[0:9]:
    tweet = api.get_status(tweet, tweet_mode='extended')

    with open('tweet_json.txt') as f:
        for line in f:
            status  = json.loads(line)
            tweet_id = # get the status ID
            retweet_count = # get retweet count
            favorite_count = # get favorite count
            df = pd.DataFrame([[tweet_id, retweet_count, favorite_count]],
                                            columns=['tweet_id', 'retweet_count', 'favorite_count'])
            df_3 = df_3.append(df)
```
