Importação/Manipulação de Dados com Python

class: center, middle, inverse, title-slide

# Importação/Manipulação de Dados com Python
## Parte 2

---

## Lendo CSV em Python, pandas

Primeiramente, `import` evoca pacotes e tem função similar a `library()` e/ou `require()` no `R`.

```python
import numpy as np #para matrizes e arrays
import matplotlib.pyplot as plt 
import pandas as pd #para dataframes
print(pd.__version__) # Checks pandas version, we need 0.18 at least
```

```
## 1.1.4
```

---

## Baby names

Dados são da SSA (Social Security Agency), mas eu só consegui baixá-los de https://github.com/hadley/data-baby-names/blob/master/baby-names.csv.

```r
#R
babyNamesR = readr::read_csv("../dados/baby-names.csv") %>% as.data.frame()
babyNamesR %>% head(n=3)
```

```
##   year    name  percent sex
## 1 1880    John 0.081541 boy
## 2 1880 William 0.080511 boy
## 3 1880   James 0.050057 boy
```

```python
#python
babyNamesPY = pd.read_csv("../dados/baby-names.csv", header = 0)
babyNamesPY.head()
```

```
##    year     name   percent  sex
## 0  1880     John  0.081541  boy
## 1  1880  William  0.080511  boy
## 2  1880    James  0.050057  boy
## 3  1880  Charles  0.045167  boy
## 4  1880   George  0.043292  boy
```

---

### Em R

```r
class(babyNamesR)
```

```
## [1] "data.frame"
```

```r
class(py$babyNamesPY)
```

```
## [1] "data.frame"
```
### Em python

```python
type(babyNamesPY)
```

```
## <class 'pandas.core.frame.DataFrame'>
```

```python
type(r.babyNamesR)
```

```
## <class 'pandas.core.frame.DataFrame'>
```

---

### Em python

```python
babyNamesPY.shape
```

```
## (258000, 4)
```

```python
babyNamesPY.info()
```

```
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 258000 entries, 0 to 257999
## Data columns (total 4 columns):
##  #   Column   Non-Null Count   Dtype  
## ---  ------   --------------   -----  
##  0   year     258000 non-null  int64  
##  1   name     258000 non-null  object 
##  2   percent  258000 non-null  float64
##  3   sex      258000 non-null  object 
## dtypes: float64(1), int64(1), object(2)
## memory usage: 7.9+ MB
```

---

## Selecionando colunas

```python
print(babyNamesPY.year) #ou print(babyNamesPY['year'])
```

```
## 0         1880
## 1         1880
## 2         1880
## 3         1880
## 4         1880
##           ... 
## 257995    2008
## 257996    2008
## 257997    2008
## 257998    2008
## 257999    2008
## Name: year, Length: 258000, dtype: int64
```

---

## Apply functions on DataFrames

```python
df = pd.DataFrame(
  {'A': [1, 2, 3],
   'B':[4, 5, 6]
  })
df.apply(np.mean,axis=0)
```

```
## A    2.0
## B    5.0
## dtype: float64
```

```python
df.apply(np.mean,axis=1)
```

```
## 0    2.5
## 1    3.5
## 2    4.5
## dtype: float64
```

---

Dados do SSA contêm somente os 1000 nomes mais comuns de cada ano....

```python
print(babyNamesPY.groupby(['year','sex']).name.count())
```

```
## year  sex 
## 1880  boy     1000
##       girl    1000
## 1881  boy     1000
##       girl    1000
## 1882  boy     1000
##               ... 
## 2006  girl    1000
## 2007  boy     1000
##       girl    1000
## 2008  boy     1000
##       girl    1000
## Name: name, Length: 258, dtype: int64
```
---

Alguns verbos coincidem com `dplyr`

```python
print(babyNamesPY.groupby(['year','sex']).percent.sum())
```

```
## year  sex 
## 1880  boy     0.930746
##       girl    0.934546
## 1881  boy     0.930439
##       girl    0.932690
## 1882  boy     0.927532
##                 ...   
## 2006  girl    0.684830
## 2007  boy     0.801105
##       girl    0.677453
## 2008  boy     0.795414
##       girl    0.672516
## Name: percent, Length: 258, dtype: float64
```

---

## Indexando (linhas)

Não é possível indexar diretamente um DataFrame, você precisa acessar o atributo `iloc`

```python
print(babyNamesPY.iloc[0]) # ou print(babyNamesPY.iloc[0,:])
```

```
## year           1880
## name           John
## percent    0.081541
## sex             boy
## Name: 0, dtype: object
```

```python
print(babyNamesPY.iloc[0:3])
```

```
##    year     name   percent  sex
## 0  1880     John  0.081541  boy
## 1  1880  William  0.080511  boy
## 2  1880    James  0.050057  boy
```

---

## No (significant number of) boys named Sue...

```python
print(babyNamesPY.loc[babyNamesPY.name == "Sue",])
```

```
##         year name   percent   sex
## 129189  1880  Sue  0.000666  girl
## 130185  1881  Sue  0.000678  girl
## 131171  1882  Sue  0.000726  girl
## 132216  1883  Sue  0.000566  girl
## 133194  1884  Sue  0.000669  girl
## ...      ...  ...       ...   ...
## 229543  1980  Sue  0.000193  girl
## 230654  1981  Sue  0.000152  girl
## 231777  1982  Sue  0.000116  girl
## 232885  1983  Sue  0.000096  girl
## 233984  1984  Sue  0.000082  girl
## 
## [105 rows x 4 columns]
```

---

## Indexando (linhas e colunas)

```python
print(babyNamesPY.iloc[0,0]) 
```

```
## 1880
```

```python
print(babyNamesPY.loc[0,'name']) 
```

```
## John
```

```python
print(babyNamesPY.loc[[0,10,100],['name','year']]) 
```

```
##        name  year
## 0      John  1880
## 10   Edward  1880
## 100   Perry  1880
```

```python
print(babyNamesPY.index)
```

```
## RangeIndex(start=0, stop=258000, step=1)
```

---

## Plot usando matplotlib

```python
babySue = babyNamesPY.loc[babyNamesPY.name == "Sue",]
babySue.plot(kind = 'scatter', x = 'year', y = 'percent')
plt.show() # from matplotlib
```

<img src="aula11b_files/figure-html/plotmat-1.png" width="60%" />
---

## Plot usando ggplot

```python
from plotnine import *
ggplot(babySue) + geom_point(aes(x = 'year', y = 'percent'))
```

```
## <ggplot: (-9223372036806471017)>
```

---

## Formatos Suportados pelo Pandas

**Observações**:
- `read_csv` possui o argumento `delimiter`, que pode ser ajustado para outros tipos de arquivos em texto plano retangulares;
- `read_csv` também possui o argumento `chunksize`, que pode ser usado para leitura por partes.

---

## Formatos Suportados pelo Pandas

---

## Pandas e Chunks

```python
reader = pd.read_csv("../dados/baby-names.csv",
        header = 0, chunksize=1000)
soma = 0
for df in reader:
  soma += df.loc[df.name == "Sue", 'percent'].sum()

print(soma)
```

```
## 0.10973800000000006
```

---

## SQLite, Pandas e Python

```python
import pandas as pd
import sqlite3
conn = sqlite3.connect("../dados/disco.db")
pd.read_sql_query("SELECT * FROM artists LIMIT 5", conn)
```

```
##    ArtistId               Name
## 0         1              AC/DC
## 1         2             Accept
## 2         3          Aerosmith
## 3         4  Alanis Morissette
## 4         5    Alice In Chains
```

```python
conn.close()
```

---

## MongoDB, Pandas e Python

```python
from pymongo import MongoClient
import pprint
myurl = "mongodb+srv://fernandaBD:mongo123@cluster0.2ph3s.mongodb.net"
client = MongoClient(myurl)
db = client['me315mongodb']
collection = db['diamantes']
collection
```

```
## Collection(Database(MongoClient(host=['cluster0-shard-00-01.2ph3s.mongodb.net:27017', 'cluster0-shard-00-00.2ph3s.mongodb.net:27017', 'cluster0-shard-00-02.2ph3s.mongodb.net:27017'], document_class=dict, tz_aware=False, connect=True, authsource='admin', replicaset='atlas-e83e3e-shard-0', ssl=True), 'me315mongodb'), 'diamantes')
```

---

## MongoDB

```python
doc = collection.find_one()
pprint.pprint(doc)
```

```
## {'_id': ObjectId('5fd034c6e17a0000d50063aa'),
##  'carat': 0.22,
##  'clarity': 'VS2',
##  'color': 'E',
##  'cut': 'Fair',
##  'depth': 65.1,
##  'price': 337,
##  'table': 61.0,
##  'x': 3.87,
##  'y': 3.78,
##  'z': 2.49}
```

---

## MongoDB

```python
doc = collection.find_one({"cut":"Premium"})
pprint.pprint(doc)
```

## MongoDB

```python
doc = collection.find({"cut":"Premium"}).limit(5)
for x in doc:
  pprint.pprint(x,width=10)
```

```
## {'_id': ObjectId('5fd034c6e17a0000d50063ae'),
##  'carat': 0.22,
##  'clarity': 'SI1',
##  'color': 'F',
##  'cut': 'Premium',
##  'depth': 60.4,
##  'price': 342,
##  'table': 61.0,
##  'x': 3.88,
##  'y': 3.84,
##  'z': 2.33}
## {'_id': ObjectId('5fd034c6e17a0000d50063da'),
##  'carat': 0.3,
##  'clarity': 'SI2',
##  'color': 'J',
##  'cut': 'Premium',
##  'depth': 59.3,
##  'price': 405,
##  'table': 61.0,
##  'x': 4.43,
##  'y': 4.38,
##  'z': 2.61}
## {'_id': ObjectId('5fd034c6e17a0000d5006407'),
##  'carat': 0.75,
##  'clarity': 'SI1',
##  'color': 'E',
##  'cut': 'Premium',
##  'depth': 59.9,
##  'price': 2760,
##  'table': 54.0,
##  'x': 6.0,
##  'y': 5.96,
##  'z': 3.58}
## {'_id': ObjectId('5fd034c6e17a0000d5006412'),
##  'carat': 0.9,
##  'clarity': 'VS2',
##  'color': 'I',
##  'cut': 'Premium',
##  'depth': 63.0,
##  'price': 2761,
##  'table': 58.0,
##  'x': 6.16,
##  'y': 6.12,
##  'z': 3.87}
## {'_id': ObjectId('5fd034c6e17a0000d50063eb'),
##  'carat': 0.3,
##  'clarity': 'SI1',
##  'color': 'H',
##  'cut': 'Premium',
##  'depth': 62.5,
##  'price': 554,
##  'table': 57.0,
##  'x': 4.29,
##  'y': 4.25,
##  'z': 2.67}
```