当前位置：首页 > news >正文

做茶叶网站的目的和规划发布平台

news 2025/8/9 10:30:23

做茶叶网站的目的和规划,发布平台,做app 的模板下载网站,seo网站排名后退提示：此节内容仅作了解即可目录二进制数据格式使用HDF5 读取Microsoft Excel文件二进制数据格式实现数据的高效二进制格式存储最简单的办法之一是使用Python内置的pickle序列化。 Python 的 pickle 模块是一个用于序列化和反序列化 Python 对象结构的模块…

提示：此节内容仅作了解即可

二进制数据格式

使用HDF5

读取Microsoft Excel文件

二进制数据格式

实现数据的高效二进制格式存储最简单的办法之一是使用Python内置的pickle序列化。

Python 的 pickle 模块是一个用于序列化和反序列化 Python 对象结构的模块。它允许你将 Python 的任何对象转换为字节流，这样你就可以将这些对象存储在文件中，或者通过网络传输。同样，pickle 也可以用来从字节流中恢复出原来的对象。

pickle模块非常强大，它能够处理几乎所有的 Python 数据类型，包括但不限于列表、字典、类实例等。但是，使用 pickle时需要注意安全性问题，因为反序列化恶意构造的数据可能会执行不安全的代码。因此，通常不建议在不信任的数据源上使用 pickle。如果需要在不安全的环境中使用序列化，可以考虑使用 json模块，尽管 json 模块只能序列化基本的数据类型。

pandas对象都有一个用于将数据以pickle格式保存到磁盘上的to_pickle方法：

In [87]: frame = pd.read_csv('examples/ex1.csv')
In [88]: frame
Out[88]: a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo
In [89]: frame.to_pickle('examples/frame_pickle')

你可以通过pickle直接读取被pickle化的数据，或是使用更为方便的pandas.read_pickle：

In [90]: pd.read_pickle('examples/frame_pickle')
Out[90]: a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

使用HDF5

HDF5是一种存储大规模科学数组数据的非常好的文件格式。它可以被作为C标准库，带有许多语言的接口，如Java、Python和MATLAB等。HDF5中的HDF指的是层次型数据格式（hierarchical data format）。每个HDF5文件都含有一个文件系统式的节点结构，它使你能够存储多个数据集并支持元数据。与其他简单格式相比，HDF5支持多种压缩器的即时压缩，还能更高效地存储重复模式数据。对于那些非常大的无法直接放入内存的数据集，HDF5就是不错的选择，因为它可以高效地分块读写。

虽然可以用PyTables或h5py库直接访问HDF5文件，pandas提供了更为高级的接口，可以简化存储Series和DataFrame对象。HDFStore类可以像字典一样，处理低级的细节：

In [92]: frame = pd.DataFrame({'a': np.random.randn(100)})
In [93]: store = pd.HDFStore('mydata.h5')
In [94]: store['obj1'] = frame
In [95]: store['obj1_col'] = frame['a']
In [96]: store
Out[96]: 
<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5
/obj1                frame        (shape->[100,1])                               
/obj1_col            series       (shape->[100])                                 
/obj2                frame_table  (typ->appendable,nrows->100,ncols->1,indexers->
[index])
/obj3                frame_table  (typ->appendable,nrows->100,ncols->1,indexers->
[index])

HDF5文件中的对象可以通过与字典一样的API进行获取:

In [97]: store['obj1']
Out[97]: a
0  -0.204708
1   0.478943
2  -0.519439
3  -0.555730
4   1.965781
..       ...
95  0.795253
96  0.118110
97 -0.748532
98  0.584970
99  0.152677
[100 rows x 1 columns]

HDFStore支持两种存储模式，’fixed’和’table’。后者通常会更慢，但是支持使用特殊语法进行查询操作：

In [98]: store.put('obj2', frame, format='table')
In [99]: store.select('obj2', where=['index >= 10 and index <= 15'])
Out[99]: a
10  1.007189
11 -1.296221
12  0.274992
13  0.228913
14  1.352917
15  0.886429
In [100]: store.close()

put是store[‘obj2’] = frame方法的显示版本，允许我们设置其它的选项，比如格式。

pandas.read_hdf函数可以快捷使用这些工具：

In [101]: frame.to_hdf('mydata.h5', 'obj3', format='table')
In [102]: pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])
Out[102]: a
0 -0.204708
1  0.478943
2 -0.519439
3 -0.555730
4  1.965781

读取Microsoft Excel文件

pandas的ExcelFile类或pandas.read_excel函数支持读取存储在Excel 2003（或更高版本）中的表格型数据。这两个工具分别使用扩展包xlrd和openpyxl读取XLS和XLSX文件。你可以用pip或conda安装它们。

要使用ExcelFile，通过传递xls或xlsx路径创建一个实例：

In [104]: xlsx = pd.ExcelFile('examples/ex1.xlsx')

存储在表单中的数据可以read_excel读取到DataFrame：

In [105]: pd.read_excel(xlsx, 'Sheet1')
Out[105]: a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

如果要读取一个文件中的多个表单，创建ExcelFile会更快，但你也可以将文件名传递到pandas.read_excel：

In [106]: frame = pd.read_excel('examples/ex1.xlsx', 'Sheet1')
In [107]: frame
Out[107]: a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

如果要将pandas数据写入为Excel格式，你必须首先创建一个ExcelWriter，然后使用pandas对象的to_excel方法将数据写入到其中：

In [108]: writer = pd.ExcelWriter('examples/ex2.xlsx')
In [109]: frame.to_excel(writer, 'Sheet1')
In [110]: writer.save()

你还可以不使用ExcelWriter，而是传递文件的路径到to_excel：

In [111]: frame.to_excel('examples/ex2.xlsx')

Web APIs交互

为了搜索最新的30个GitHub上的pandas主题，我们可以发一个HTTP GET请求，使用requests扩展库：

In [113]: import requests
In [114]: url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
In [115]: resp = requests.get(url)
In [116]: resp
Out[116]: <Response [200]>

响应对象的json方法会返回一个包含被解析过的JSON字典，加载到一个Python对象中：

In [117]: data = resp.json()
In [118]: data[0]['title']
Out[118]: 'Period does not round down for frequencies less that 1 hour'

data中的每个元素都是一个包含所有GitHub主题页数据（不包含评论）的字典。我们可以直接传递数据到DataFrame，并提取感兴趣的字段：

In [119]: issues = pd.DataFrame(data, columns=['number', 'title',.....:                                      'labels', 'state'])
In [120]: issues
Out[120]:number                                              title  \
0    17666  Period does not round down for frequencies les...   
1    17665           DOC: improve docstring of function where   
2    17664               COMPAT: skip 32-bit test on int repr   
3    17662                          implement Delegator class
4    17654  BUG: Fix series rename called with str alterin...   
..     ...                                                ...   
25   17603  BUG: Correctly localize naive datetime strings...   
26   17599                     core.dtypes.generic --> cython   
27   17596   Merge cdate_range functionality into bdate_range   
28   17587  Time Grouper bug fix when applied for list gro...   
29   17583  BUG: fix tz-aware DatetimeIndex + TimedeltaInd...   labels state  
0                                                  []  open  
1   [{'id': 134699, 'url': 'https://api.github.com...  open  
2   [{'id': 563047854, 'url': 'https://api.github....  open  
3                                                  []  open  
4   [{'id': 76811, 'url': 'https://api.github.com/...  open  
..                                                ...   ...  
25  [{'id': 76811, 'url': 'https://api.github.com/...  open  
26  [{'id': 49094459, 'url': 'https://api.github.c...  open  
27  [{'id': 35818298, 'url': 'https://api.github.c...  open  
28  [{'id': 233160, 'url': 'https://api.github.com...  open  
29  [{'id': 76811, 'url': 'https://api.github.com/...  open  
[30 rows x 4 columns]

查看全文

http://www.mmbaike.com/news/92796.html